Abstract
This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (\(\textit{WB}\), 10K sentences), Penn Chinese Treebank 7.0 (\(\textit{CTB7}\), 50K), and People’s Daily (\(\textit{PD}\), 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine \(\textit{WB}\), \(\textit{CTB7}\), and \(\textit{PD}\), boosting F1 score from \(93.76\%\) (baseline model trained on only \(\textit{WB}\)) to \(95.58\%\) (\(+1.82\%\)). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert \(\textit{PD}\) into the annotation style of \(\textit{CTB7}\) based on coupled sequence labeling, denoted by \(\textit{PD}^{\textit{CTB}}\). Then, we merge CTB7 and \(\textit{PD}^{\textit{CTB}}\) to train a POS tagger, denoted by \(\textit{Tag}_{\textit{CTB7}+\textit{PD}^{\textit{CTB}}}\), which is further used to produce guide features on \(\textit{WB}\). Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).
This work was supported by National Natural Science Foundation of China (Grant No.61432013, 61273319) and Jiangsu Planned Projects for Post-doctoral Research Funds (No.1401075B).
Preview
Unable to display preview. Download preview PDF.
References
Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging - a case study. In: Proceedings of ACL, pp. 522–530 (2009)
Jiang, W., Huang, L., Liu, Q., Lü, Y.: A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL 2008: HLT, pp. 897–904 (2008)
Jiang, W., Sun, M., Lü, Y., Yang, Y., Liu, Q.: Discriminative learning with natural annotations: word segmentation as a case study. In: Proceedings of ACL, pp. 761–769 (2013)
Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., Isahara, H.: An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In: Proceedings of ACL-AFNLP 2009, pp. 513–521 (2009)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)
Li, Z., Chao, J., Zhang, M., Chen, W.: Coupled sequence labeling on heterogeneous annotations: pos tagging as a case study. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 1783–1792. Association for Computational Linguistics, Beijing july, 2015
Li, Z., Che, W., Liu, T.: Exploiting multiple treebanks for parsing with quasisynchronous grammar. In: ACL, pp. 675–684 (2012)
Liu, Y., Zhang, Y., Che, W., Liu, T., Wu, F.: Domain adaptation for CRF-based Chinese word segmentation using free annotations. In: Proceedings of EMNLP, pp. 864–874 (2014)
Qiu, X., Huang, C., Huang, X.: Automatic corpus expansion for Chinese word segmentation by exploiting the redundancy of web information. In: Proceedings of COLING, pp. 1154–1164 (2014)
Qiu, X., Qian, P., Huang, X.: Overview of the nlpcc 2015 shared task: chinese word segmentation and pos tagging for micro-blog texts (2015). arXiv preprint arXiv:1505.07599
Qiu, X., Zhao, J., Huang, X.: Joint Chinese word segmentation and POS tagging on heterogeneous annotated corpora with multiple task learning. In: Proceedings of EMNLP, pp. 658–668 (2013)
Sun, W.: A stacked sub-word model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL, pp. 1385–1394 (2011)
Sun, W., Wan, X.: Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations. In: Proceedings of ACL, pp. 232–241 (2012)
Sun, W., Xu, J.: Enhancing chinese word segmentation using unlabeled data. In: Proceedings of EMNLP, pp. 970–979 (2011)
Wang, A., Kan, M.Y.: Mining informal language from chinese microtext: joint word recognition and segmentation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 731–741. Association for Computational Linguistics, Sofia, August 2013
Xue, N., et al.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Yang, F., Vozila, P.: Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In: Proceedings of EMNLP, pp. 90–98 (2014)
Zeng, X., Wong, D.F., Chao, L.S., Trancoso, I.: Graph-based semi-supervised model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 770–779. Association for Computational Linguistics, Sofia, August 2013
Zhang, L., Wang, H., Sun, X., Mansur, M.: Exploring representations from unlabeled data with co-training for Chinese word segmentation. In: Proceedings of EMNLP, pp. 311–321 (2013)
Zhang, L., Wang, H., Sun, X., Mansur, M.: Improving Chinese word segmentation on micro-blog using rich punctuations. In: Proceedings of ACL: Short Papers (2013)
Zhang, M., Zhang, Y., Che, W., Liu, T.: Character-level Chinese dependency parsing. In: Proceedings of ACL, pp. 1326–1336 (2014)
Zhang, M., Zhang, Y., Che, W., Liu, T.: Type-supervised domain adaptation for joint segmentation and POS-tagging. In: Proceedings of COLING, pp. 588–597 (2014)
Zhang, Y., Clark, S.: Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL 2008: HLT, pp. 888–896 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Chao, J., Li, Z., Chen, W., Zhang, M. (2015). Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging. In: Li, J., Ji, H., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2015. Lecture Notes in Computer Science(), vol 9362. Springer, Cham. https://doi.org/10.1007/978-3-319-25207-0_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-25207-0_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25206-3
Online ISBN: 978-3-319-25207-0
eBook Packages: Computer ScienceComputer Science (R0)