Chinese Word Segmentation with Character Abstraction

Tian, Le; Qiu, Xipeng; Huang, Xuanjing

doi:10.1007/978-3-642-41491-6_4

Le Tian²³,
Xipeng Qiu²³ &
Xuanjing Huang²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8202))

Included in the following conference series:

1636 Accesses

Abstract

Chinese word segmentation is an important and necessary problem to analyze Chinese texts. In this paper, we focus on the primary challenges in Chinese word segmentation: low accuracy of out-of-vocabulary word. To resolve this difficult problems, we group the “similar” characters to generate more abstract representation. Experimental results show that character abstraction yields a significant relative error reduction of 24.83% in average over the state-of-the-art baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andrew, G.: A hybrid markov/semi-markov conditional random field for sequence segmentation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 465–472. Association for Computational Linguistics (2006)
Google Scholar
Brown, P., Desouza, P., Mercer, R., Pietra, V., Lai, J.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (2002)
Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)
Google Scholar
Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research 3, 951–991 (2003)
MathSciNet MATH Google Scholar
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. Journal of Machine Learning Research 7, 551–585 (2006)
MathSciNet MATH Google Scholar
Dong, Z., Dong, Q.: Hownet and the Computation of Meaning. World Scientific Publishing Co., Inc., River Edge (2006)
Book Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, New York (2001)
MATH Google Scholar
Li, W., McCallum, A.: Semi-supervised sequence modeling with syntactic topic models. In: Proceedings of the National Conference on Artificial Intelligence, p. 813 (2005)
Google Scholar
Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Google Scholar
Mnih, A., Hinton, G.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems 21, pp. 1081–1088 (2009)
Google Scholar
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)
Google Scholar
Qiu, X., Zhang, Q., Huang, X.: FudanNLP: A toolkit for Chinese natural language processing. In: Proceedings of ACL (2013)
Google Scholar
Sarawagi, S., Cohen, W.: Semi-markov conditional random fields for information extraction. In: Advances in Neural Information Processing Systems 17, pp. 1185–1192 (2005)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. Urbana 51, 61801 (2010)
Google Scholar
Xue, N.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Google Scholar
Zhao, H., Huang, C., Li, M., Lu, B.: A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing (TALIP) 9(2), 5 (2010)
Article Google Scholar
Zhao, H., Liu, Q.: The cips-sighan clp 2010 Chinese word segmentation bakeoff. In: Proceedings of the First CPS-SIGHAN Joint Conference on Chinese Language Processing (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, China
Le Tian, Xipeng Qiu & Xuanjing Huang

Authors

Le Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xipeng Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Xuanjing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Maosong Sun
Horizon Doctoral Training Centre, School of Computer Science, University of Nottingham, NG8 1BB, Nottingham, UK
Min Zhang
Google Inc., Mountain View, CA, USA
Dekang Lin
Baidu Inc., Beijing, China
Haifeng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, L., Qiu, X., Huang, X. (2013). Chinese Word Segmentation with Character Abstraction. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-41491-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics