Abstract
Chinese word segmentation is an important and necessary problem to analyze Chinese texts. In this paper, we focus on the primary challenges in Chinese word segmentation: low accuracy of out-of-vocabulary word. To resolve this difficult problems, we group the “similar” characters to generate more abstract representation. Experimental results show that character abstraction yields a significant relative error reduction of 24.83% in average over the state-of-the-art baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andrew, G.: A hybrid markov/semi-markov conditional random field for sequence segmentation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 465–472. Association for Computational Linguistics (2006)
Brown, P., Desouza, P., Mercer, R., Pietra, V., Lai, J.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (2002)
Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)
Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research 3, 951–991 (2003)
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. Journal of Machine Learning Research 7, 551–585 (2006)
Dong, Z., Dong, Q.: Hownet and the Computation of Meaning. World Scientific Publishing Co., Inc., River Edge (2006)
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, New York (2001)
Li, W., McCallum, A.: Semi-supervised sequence modeling with syntactic topic models. In: Proceedings of the National Conference on Artificial Intelligence, p. 813 (2005)
Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Mnih, A., Hinton, G.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems 21, pp. 1081–1088 (2009)
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)
Qiu, X., Zhang, Q., Huang, X.: FudanNLP: A toolkit for Chinese natural language processing. In: Proceedings of ACL (2013)
Sarawagi, S., Cohen, W.: Semi-markov conditional random fields for information extraction. In: Advances in Neural Information Processing Systems 17, pp. 1185–1192 (2005)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. Urbana 51, 61801 (2010)
Xue, N.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Zhao, H., Huang, C., Li, M., Lu, B.: A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing (TALIP) 9(2), 5 (2010)
Zhao, H., Liu, Q.: The cips-sighan clp 2010 Chinese word segmentation bakeoff. In: Proceedings of the First CPS-SIGHAN Joint Conference on Chinese Language Processing (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tian, L., Qiu, X., Huang, X. (2013). Chinese Word Segmentation with Character Abstraction. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-41491-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)