Skip to main content

Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9362))

Abstract

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (\(\textit{WB}\), 10K sentences), Penn Chinese Treebank 7.0 (\(\textit{CTB7}\), 50K), and People’s Daily (\(\textit{PD}\), 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine \(\textit{WB}\), \(\textit{CTB7}\), and \(\textit{PD}\), boosting F1 score from \(93.76\%\) (baseline model trained on only \(\textit{WB}\)) to \(95.58\%\) (\(+1.82\%\)). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert \(\textit{PD}\) into the annotation style of \(\textit{CTB7}\) based on coupled sequence labeling, denoted by \(\textit{PD}^{\textit{CTB}}\). Then, we merge CTB7 and \(\textit{PD}^{\textit{CTB}}\) to train a POS tagger, denoted by \(\textit{Tag}_{\textit{CTB7}+\textit{PD}^{\textit{CTB}}}\), which is further used to produce guide features on \(\textit{WB}\). Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

This work was supported by National Natural Science Foundation of China (Grant No.61432013, 61273319) and Jiangsu Planned Projects for Post-doctoral Research Funds (No.1401075B).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging - a case study. In: Proceedings of ACL, pp. 522–530 (2009)

    Google Scholar 

  2. Jiang, W., Huang, L., Liu, Q., Lü, Y.: A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL 2008: HLT, pp. 897–904 (2008)

    Google Scholar 

  3. Jiang, W., Sun, M., Lü, Y., Yang, Y., Liu, Q.: Discriminative learning with natural annotations: word segmentation as a case study. In: Proceedings of ACL, pp. 761–769 (2013)

    Google Scholar 

  4. Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., Isahara, H.: An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In: Proceedings of ACL-AFNLP 2009, pp. 513–521 (2009)

    Google Scholar 

  5. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)

    Google Scholar 

  6. Li, Z., Chao, J., Zhang, M., Chen, W.: Coupled sequence labeling on heterogeneous annotations: pos tagging as a case study. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 1783–1792. Association for Computational Linguistics, Beijing july, 2015

    Google Scholar 

  7. Li, Z., Che, W., Liu, T.: Exploiting multiple treebanks for parsing with quasisynchronous grammar. In: ACL, pp. 675–684 (2012)

    Google Scholar 

  8. Liu, Y., Zhang, Y., Che, W., Liu, T., Wu, F.: Domain adaptation for CRF-based Chinese word segmentation using free annotations. In: Proceedings of EMNLP, pp. 864–874 (2014)

    Google Scholar 

  9. Qiu, X., Huang, C., Huang, X.: Automatic corpus expansion for Chinese word segmentation by exploiting the redundancy of web information. In: Proceedings of COLING, pp. 1154–1164 (2014)

    Google Scholar 

  10. Qiu, X., Qian, P., Huang, X.: Overview of the nlpcc 2015 shared task: chinese word segmentation and pos tagging for micro-blog texts (2015). arXiv preprint arXiv:1505.07599

  11. Qiu, X., Zhao, J., Huang, X.: Joint Chinese word segmentation and POS tagging on heterogeneous annotated corpora with multiple task learning. In: Proceedings of EMNLP, pp. 658–668 (2013)

    Google Scholar 

  12. Sun, W.: A stacked sub-word model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL, pp. 1385–1394 (2011)

    Google Scholar 

  13. Sun, W., Wan, X.: Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations. In: Proceedings of ACL, pp. 232–241 (2012)

    Google Scholar 

  14. Sun, W., Xu, J.: Enhancing chinese word segmentation using unlabeled data. In: Proceedings of EMNLP, pp. 970–979 (2011)

    Google Scholar 

  15. Wang, A., Kan, M.Y.: Mining informal language from chinese microtext: joint word recognition and segmentation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 731–741. Association for Computational Linguistics, Sofia, August 2013

    Google Scholar 

  16. Xue, N., et al.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)

    Google Scholar 

  17. Yang, F., Vozila, P.: Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In: Proceedings of EMNLP, pp. 90–98 (2014)

    Google Scholar 

  18. Zeng, X., Wong, D.F., Chao, L.S., Trancoso, I.: Graph-based semi-supervised model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 770–779. Association for Computational Linguistics, Sofia, August 2013

    Google Scholar 

  19. Zhang, L., Wang, H., Sun, X., Mansur, M.: Exploring representations from unlabeled data with co-training for Chinese word segmentation. In: Proceedings of EMNLP, pp. 311–321 (2013)

    Google Scholar 

  20. Zhang, L., Wang, H., Sun, X., Mansur, M.: Improving Chinese word segmentation on micro-blog using rich punctuations. In: Proceedings of ACL: Short Papers (2013)

    Google Scholar 

  21. Zhang, M., Zhang, Y., Che, W., Liu, T.: Character-level Chinese dependency parsing. In: Proceedings of ACL, pp. 1326–1336 (2014)

    Google Scholar 

  22. Zhang, M., Zhang, Y., Che, W., Liu, T.: Type-supervised domain adaptation for joint segmentation and POS-tagging. In: Proceedings of COLING, pp. 588–597 (2014)

    Google Scholar 

  23. Zhang, Y., Clark, S.: Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL 2008: HLT, pp. 888–896 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenghua Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Chao, J., Li, Z., Chen, W., Zhang, M. (2015). Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging. In: Li, J., Ji, H., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2015. Lecture Notes in Computer Science(), vol 9362. Springer, Cham. https://doi.org/10.1007/978-3-319-25207-0_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25207-0_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25206-3

  • Online ISBN: 978-3-319-25207-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics