Skip to main content

Integrating Character Representations into Chinese Word Embedding

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10085))

Abstract

In this paper we propose a novel word representation for Chinese based on a state-of-the-art word embedding approach. Our main contribution is to integrate distributional representations of Chinese characters into the word embedding. Recent related work on European languages has demonstrated that information from inflectional morphology can reduce the problem of sparse data and improve word representations. Chinese has very little inflectional morphology, but there is potential for incorporating character-level information. Chinese characters are drawn from a fixed set – with just under four thousand in common usage – but a major problem with using characters is their ambiguity. In order to address this problem, we disambiguate the characters according to groupings in a semantic hierarchy. Coupling our character embeddings with word embeddings, we observe improved performance on the tasks of finding synonyms and rating word similarity compared to a model using word embeddings alone, especially for low frequency words.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)

    MATH  Google Scholar 

  • Botha, J., Blunsom, P.: Compositional morphology for word representations and language modeling. In: Proceedings of ICML (2014)

    Google Scholar 

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011)

    MATH  Google Scholar 

  • Curran, J., Moens, M.: Scaling context space. In: Proceedings of ACL, pp. 231–238 (2002)

    Google Scholar 

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems 20(1), 116–131 (2002)

    Article  Google Scholar 

  • Huang, C.-R., Chen, K.-J., Lai, C.: Mandarin Daily Classification Dictionary. Mandarin Daily Press, Taipei (1997)

    Google Scholar 

  • Jin, P., Wu, Y.: Semeval-2012 task 4: evaluating chinese word similarity. In: Proceedings of First Joint Conference of Lexical and Computational Semantics, pp. 374–377 (2012)

    Google Scholar 

  • Levy, O., Goldberg, Y.: Dependency-based word embedding. In: Proceedings of ACL, pp. 23–25 (2014a)

    Google Scholar 

  • Levy, O., Goldberg, Y.: Word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. arxiv1402.3722v1 (2014b)

    Google Scholar 

  • Li, M., Zong, C., Ng, H.T.: Automatic evaluation of chinese translation output: word-level or character-level?. In: Proceedings of ACL, pp. 159–164 (2011)

    Google Scholar 

  • Li, Z.: Parsing the internal structure of words: a new paradigm for chinese word segmentation. In: Proceedings of ACL, pp. 1405–1414 (2011)

    Google Scholar 

  • Liu, C., Ng, H.T.: Character-level machine translation evaluation for languages with ambiguous word boundaries. In: Proceedings of ACL, pp. 921–929 (2012)

    Google Scholar 

  • Luong, M.-T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of CoNLL, pp. 104–113 (2013)

    Google Scholar 

  • Mei, J., Zheng, Y., Gao, Y., Yin, H.: TongYiCiCiLin. The Commercial Press, Shanghai (1984)

    Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013a)

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013b)

    Google Scholar 

  • Mnihand, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of ICML (2007)

    Google Scholar 

  • Morinand, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: AISTATS (2005)

    Google Scholar 

  • Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of ACL, pp. 1532–1543 (2014)

    Google Scholar 

  • Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Proceedings of IJCNLP, pp. 210–218 (2011)

    Google Scholar 

  • Schwenk, H.: Continuous space language models. Computer Speech and Language 21, 492–518 (2007)

    Article  Google Scholar 

  • Tseng, H.: Semantic classification of chinese unknown words. In: Proceedings of ACL (2003)

    Google Scholar 

  • Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL, pp. 384–394 (2010)

    Google Scholar 

  • Yu, M., Dredze, M.: Improving lexical embedding with Semantic knowledge. In: Proceedings of ACL, pp. 545–550 (2014)

    Google Scholar 

  • Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of EMNLP, pp. 1393–1398 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Chen, X., Jin, P., McCarthy, D., Carroll, J. (2016). Integrating Character Representations into Chinese Word Embedding. In: Dong, M., Lin, J., Tang, X. (eds) Chinese Lexical Semantics. CLSW 2016. Lecture Notes in Computer Science(), vol 10085. Springer, Cham. https://doi.org/10.1007/978-3-319-49508-8_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49508-8_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49507-1

  • Online ISBN: 978-3-319-49508-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics