Reference Hub3
The Specification of POS Tagging of the Hong Kong University Cantonese Corpus

The Specification of POS Tagging of the Hong Kong University Cantonese Corpus

Wong Ping-Wai
Copyright: © 2006 |Volume: 2 |Issue: 1 |Pages: 18
ISSN: 1548-3908|EISSN: 1548-3916|ISSN: 1548-3908|EISBN13: 9781615204366|EISSN: 1548-3916|DOI: 10.4018/jthi.2006010102
Cite Article Cite Article

MLA

Ping-Wai, Wong. "The Specification of POS Tagging of the Hong Kong University Cantonese Corpus." IJTHI vol.2, no.1 2006: pp.21-38. http://doi.org/10.4018/jthi.2006010102

APA

Ping-Wai, W. (2006). The Specification of POS Tagging of the Hong Kong University Cantonese Corpus. International Journal of Technology and Human Interaction (IJTHI), 2(1), 21-38. http://doi.org/10.4018/jthi.2006010102

Chicago

Ping-Wai, Wong. "The Specification of POS Tagging of the Hong Kong University Cantonese Corpus," International Journal of Technology and Human Interaction (IJTHI) 2, no.1: 21-38. http://doi.org/10.4018/jthi.2006010102

Export Reference

Mendeley
Favorite Full-Issue Download

Abstract

The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was word-segmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). This scheme, which was designed for tagging written Mandarin texts, encountered some problems in tagging spoken Cantonese. However, it is flexible for further expansion of the 26 basic word classes by customizing some subclasses for annotating other Chinese dialects (e.g., Cantonese). Its robustness was proved by the annotation of approximately 230,000 words in the HKUCC. This article will describe the format of the corpus and provide the specification that helps annotators in POS tagging and will solve problems encountered in manual annotation. Guidelines of tagging some word classes will be introduced, followed by the discussion of easily confused tags, illustrated with examples from the corpus. Further work will aim at automatic annotation by computers in order to facilitate the work of POS tagging of Cantonese and other Chinese dialects. The corpora of Hong Kong Cantonese are quite lacking. Past work focused either on a POS-tagged corpus for child language or the phonetic transcription of an adult Cantonese corpus. HKUCC fills the gap by providing a POS-tagged corpus for adult Cantonese and is believed to be of great value to the data-driven linguistic analysis and natural language processing for Cantonese.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.