Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging

Chao, Jiayuan; Li, Zhenghua; Chen, Wenliang; Zhang, Min

doi:10.1007/978-3-319-25207-0_46

Jiayuan Chao²³,
Zhenghua Li²³,
Wenliang Chen²³ &
…
Min Zhang²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9362))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2301 Accesses
3 Citations

Abstract

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (\(\textit{WB}\), 10K sentences), Penn Chinese Treebank 7.0 (\(\textit{CTB7}\), 50K), and People’s Daily (\(\textit{PD}\), 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine \(\textit{WB}\), \(\textit{CTB7}\), and \(\textit{PD}\), boosting F1 score from \(93.76\%\) (baseline model trained on only \(\textit{WB}\)) to \(95.58\%\) (\(+1.82\%\)). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert \(\textit{PD}\) into the annotation style of \(\textit{CTB7}\) based on coupled sequence labeling, denoted by \(\textit{PD}^{\textit{CTB}}\). Then, we merge CTB7 and \(\textit{PD}^{\textit{CTB}}\) to train a POS tagger, denoted by \(\textit{Tag}_{\textit{CTB7}+\textit{PD}^{\textit{CTB}}}\), which is further used to produce guide features on \(\textit{WB}\). Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

This work was supported by National Natural Science Foundation of China (Grant No.61432013, 61273319) and Jiangsu Planned Projects for Post-doctoral Research Funds (No.1401075B).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging - a case study. In: Proceedings of ACL, pp. 522–530 (2009)
Google Scholar
Jiang, W., Huang, L., Liu, Q., Lü, Y.: A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL 2008: HLT, pp. 897–904 (2008)
Google Scholar
Jiang, W., Sun, M., Lü, Y., Yang, Y., Liu, Q.: Discriminative learning with natural annotations: word segmentation as a case study. In: Proceedings of ACL, pp. 761–769 (2013)
Google Scholar
Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., Isahara, H.: An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In: Proceedings of ACL-AFNLP 2009, pp. 513–521 (2009)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)
Google Scholar
Li, Z., Chao, J., Zhang, M., Chen, W.: Coupled sequence labeling on heterogeneous annotations: pos tagging as a case study. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 1783–1792. Association for Computational Linguistics, Beijing july, 2015
Google Scholar
Li, Z., Che, W., Liu, T.: Exploiting multiple treebanks for parsing with quasisynchronous grammar. In: ACL, pp. 675–684 (2012)
Google Scholar
Liu, Y., Zhang, Y., Che, W., Liu, T., Wu, F.: Domain adaptation for CRF-based Chinese word segmentation using free annotations. In: Proceedings of EMNLP, pp. 864–874 (2014)
Google Scholar
Qiu, X., Huang, C., Huang, X.: Automatic corpus expansion for Chinese word segmentation by exploiting the redundancy of web information. In: Proceedings of COLING, pp. 1154–1164 (2014)
Google Scholar
Qiu, X., Qian, P., Huang, X.: Overview of the nlpcc 2015 shared task: chinese word segmentation and pos tagging for micro-blog texts (2015). arXiv preprint arXiv:1505.07599
Qiu, X., Zhao, J., Huang, X.: Joint Chinese word segmentation and POS tagging on heterogeneous annotated corpora with multiple task learning. In: Proceedings of EMNLP, pp. 658–668 (2013)
Google Scholar
Sun, W.: A stacked sub-word model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL, pp. 1385–1394 (2011)
Google Scholar
Sun, W., Wan, X.: Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations. In: Proceedings of ACL, pp. 232–241 (2012)
Google Scholar
Sun, W., Xu, J.: Enhancing chinese word segmentation using unlabeled data. In: Proceedings of EMNLP, pp. 970–979 (2011)
Google Scholar
Wang, A., Kan, M.Y.: Mining informal language from chinese microtext: joint word recognition and segmentation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 731–741. Association for Computational Linguistics, Sofia, August 2013
Google Scholar
Xue, N., et al.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Google Scholar
Yang, F., Vozila, P.: Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In: Proceedings of EMNLP, pp. 90–98 (2014)
Google Scholar
Zeng, X., Wong, D.F., Chao, L.S., Trancoso, I.: Graph-based semi-supervised model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 770–779. Association for Computational Linguistics, Sofia, August 2013
Google Scholar
Zhang, L., Wang, H., Sun, X., Mansur, M.: Exploring representations from unlabeled data with co-training for Chinese word segmentation. In: Proceedings of EMNLP, pp. 311–321 (2013)
Google Scholar
Zhang, L., Wang, H., Sun, X., Mansur, M.: Improving Chinese word segmentation on micro-blog using rich punctuations. In: Proceedings of ACL: Short Papers (2013)
Google Scholar
Zhang, M., Zhang, Y., Che, W., Liu, T.: Character-level Chinese dependency parsing. In: Proceedings of ACL, pp. 1326–1336 (2014)
Google Scholar
Zhang, M., Zhang, Y., Che, W., Liu, T.: Type-supervised domain adaptation for joint segmentation and POS-tagging. In: Proceedings of COLING, pp. 588–597 (2014)
Google Scholar
Zhang, Y., Clark, S.: Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL 2008: HLT, pp. 888–896 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, Jiangsu, China
Jiayuan Chao, Zhenghua Li, Wenliang Chen & Min Zhang

Authors

Jiayuan Chao
View author publications
You can also search for this author in PubMed Google Scholar
Zhenghua Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenliang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenghua Li .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Juanzi Li
Rensselaer Polytechnic Institute, Troy, NY, USA
Heng Ji
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Yansong Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chao, J., Li, Z., Chen, W., Zhang, M. (2015). Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging. In: Li, J., Ji, H., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2015. Lecture Notes in Computer Science(), vol 9362. Springer, Cham. https://doi.org/10.1007/978-3-319-25207-0_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-25207-0_46
Published: 20 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25206-3
Online ISBN: 978-3-319-25207-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics