Skip to main content

A Feature-Rich CRF Segmenter for Chinese Micro-Blog

  • Conference paper
  • First Online:
Natural Language Understanding and Intelligent Applications (ICCPOL 2016, NLPCC 2016)

Abstract

This paper describes our system for Chinese word segmentation of micro-blog text, one of the NLPCC-ICCPOL 2016 Shared Tasks [1]. The CRF (Conditional Random Field) model is employed to model word segmentation as a sequence labeling problem, 7 sets of features are selected to train the CRF model. The system achieves \(f_{b}\) 0.798144 on closed track, 0.81968 on semi-open track, and 0.82217 on open track with weighted measures [2].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://crfpp.googlecode.com/svn/trunk/doc/index.html.

  2. 2.

    https://github.com/sunflowerlyb/idiom.

References

  1. Qiu, X., Qian, P., Shi, Z., Wu, S.: Overview of the NLPCC 2016 Shared Task: Chinese Word Segmentation for Micro-Blog Texts

    Google Scholar 

  2. Qian, P., Qiu, X., Huang, X.: A new psychometric-inspired evaluation metric for chinese word segmentation In: Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  4. Jin, K.L., Ng, H.T., Guo, W.: A maximum entropy approach to Chinese word segmentation. In: Proceedings of 4th SIGHAN Workshop on Chinese Language Processing (2005)

    Google Scholar 

  5. Peng, F., Feng, F., Mccallum, A.: Chinese segmentation, new word detection using conditional random fields. In: Proceedings of COLING, pp. 562–568 (2004)

    Google Scholar 

  6. Chen, X., Qiu, X., Zhu, C., et al.: Long short-term memory neural networks for Chinese word segmentation. In: Conference on Empirical Methods in Natural Language Processing (2015)

    Google Scholar 

  7. Zhao, H., Li, M., Lu, B.L., et al.: Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: 20th Pacific Asia Conference on Language, Information, Computation, pp. 87–94 (2006)

    Google Scholar 

  8. Yan, J.: Research and application of Chinese word segmentation based on conditional random fields (2009). (in Chinese)

    Google Scholar 

  9. Gao, Q., Vogel, S.: A multi-layer chinese word segmentation system optimized for out-of-domain tasks (2010)

    Google Scholar 

  10. Wu, G., et al.: Leveraging rich linguistic features for cross-domain Chinese segmentation. In: CIPS-SIGHAN Joint Conference on Chinese Language Processing (2014)

    Google Scholar 

  11. Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of 4th SIGHAN Workshop on Chinese Language Processing, p. 133 (2005)

    Google Scholar 

  12. Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). doi:10.1007/978-3-540-31865-1_25

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was partially supported by Natural Science Foundation of China (No. 61273365), discipline building plan in 111 base (No. B08004) and Engineering Research Center of Information Networks of MOE, and the Co-construction Program with the Beijing Municipal Commission of Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yabin Leng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Leng, Y., Liu, W., Wang, S., Wang, X. (2016). A Feature-Rich CRF Segmenter for Chinese Micro-Blog. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_78

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50496-4_78

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50495-7

  • Online ISBN: 978-3-319-50496-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics