Abstract
In this paper we propose a novel word representation for Chinese based on a state-of-the-art word embedding approach. Our main contribution is to integrate distributional representations of Chinese characters into the word embedding. Recent related work on European languages has demonstrated that information from inflectional morphology can reduce the problem of sparse data and improve word representations. Chinese has very little inflectional morphology, but there is potential for incorporating character-level information. Chinese characters are drawn from a fixed set – with just under four thousand in common usage – but a major problem with using characters is their ambiguity. In order to address this problem, we disambiguate the characters according to groupings in a semantic hierarchy. Coupling our character embeddings with word embeddings, we observe improved performance on the tasks of finding synonyms and rating word similarity compared to a model using word embeddings alone, especially for low frequency words.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)
Botha, J., Blunsom, P.: Compositional morphology for word representations and language modeling. In: Proceedings of ICML (2014)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011)
Curran, J., Moens, M.: Scaling context space. In: Proceedings of ACL, pp. 231–238 (2002)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems 20(1), 116–131 (2002)
Huang, C.-R., Chen, K.-J., Lai, C.: Mandarin Daily Classification Dictionary. Mandarin Daily Press, Taipei (1997)
Jin, P., Wu, Y.: Semeval-2012 task 4: evaluating chinese word similarity. In: Proceedings of First Joint Conference of Lexical and Computational Semantics, pp. 374–377 (2012)
Levy, O., Goldberg, Y.: Dependency-based word embedding. In: Proceedings of ACL, pp. 23–25 (2014a)
Levy, O., Goldberg, Y.: Word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. arxiv1402.3722v1 (2014b)
Li, M., Zong, C., Ng, H.T.: Automatic evaluation of chinese translation output: word-level or character-level?. In: Proceedings of ACL, pp. 159–164 (2011)
Li, Z.: Parsing the internal structure of words: a new paradigm for chinese word segmentation. In: Proceedings of ACL, pp. 1405–1414 (2011)
Liu, C., Ng, H.T.: Character-level machine translation evaluation for languages with ambiguous word boundaries. In: Proceedings of ACL, pp. 921–929 (2012)
Luong, M.-T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of CoNLL, pp. 104–113 (2013)
Mei, J., Zheng, Y., Gao, Y., Yin, H.: TongYiCiCiLin. The Commercial Press, Shanghai (1984)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013a)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013b)
Mnihand, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of ICML (2007)
Morinand, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: AISTATS (2005)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of ACL, pp. 1532–1543 (2014)
Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Proceedings of IJCNLP, pp. 210–218 (2011)
Schwenk, H.: Continuous space language models. Computer Speech and Language 21, 492–518 (2007)
Tseng, H.: Semantic classification of chinese unknown words. In: Proceedings of ACL (2003)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL, pp. 384–394 (2010)
Yu, M., Dredze, M.: Improving lexical embedding with Semantic knowledge. In: Proceedings of ACL, pp. 545–550 (2014)
Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of EMNLP, pp. 1393–1398 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Chen, X., Jin, P., McCarthy, D., Carroll, J. (2016). Integrating Character Representations into Chinese Word Embedding. In: Dong, M., Lin, J., Tang, X. (eds) Chinese Lexical Semantics. CLSW 2016. Lecture Notes in Computer Science(), vol 10085. Springer, Cham. https://doi.org/10.1007/978-3-319-49508-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-49508-8_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49507-1
Online ISBN: 978-3-319-49508-8
eBook Packages: Computer ScienceComputer Science (R0)