Chinese Word Embedding Learning with Limited Data

Chen, Shurui; Chen, Yufu; Lu, Yuyin; Rao, Yanghui; Xie, Haoran; Li, Qing

doi:10.1007/978-3-030-85896-4_18

Shurui Chen¹²,
Yufu Chen¹²,
Yuyin Lu¹²,
Yanghui Rao¹²,
Haoran Xie¹³ &
…
Qing Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12858))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

1633 Accesses

Abstract

With the increasing demands of high-quality Chinese word embeddings for natural language processing, Chinese word embedding learning has attracted wide attention in recent years. Most of the existing research focused on capturing word semantics on large-scaled datasets. However, these methods are difficult to obtain effective word embeddings with limited data used for some specific fields. Observing the rich semantic information of Chinese fine-grained structures, we develop a model to fully fuse Chinese fine-grained structures as auxiliary information for word embedding learning. The proposed model views the word context information as a combination of word, character, pronunciation, and component. Besides, it adds the semantic relationship between pronunciations and components as a constraint to exploit auxiliary information comprehensively. Based on the decomposition of shifted positive pointwise mutual information matrix, our model could effectively generate Chinese word embeddings on small-scaled data. The results of word analogy, word similarity, and name entity recognition conducted on two public datasets show the effectiveness of our proposed model for capturing Chinese word semantics with limited data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Chinese word embeddings from semantic and phonetic components

Article 10 August 2022

Radical and Stroke-Enhanced Chinese Word Embeddings Based on Neural Networks

Article 04 July 2020

Improved Learning of Chinese Word Embeddings with Semantic Knowledge

Notes

References

Ailem, M., Salah, A., Nadif, M.: Non-negative matrix factorization meets word embedding. In: SIGIR, pp. 1081–1084 (2017)
Google Scholar
Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs word2vec embeddings in small corpora: a case study in dreams database. CoRR abs/1610.01520 (2016)
Google Scholar
Avraham, O., Goldberg, Y.: The interplay of semantics and morphology in word embeddings. In: EACL, pp. 422–426 (2017)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Cao, S., Lu, W., Zhou, J., Li, X.: cw2vec: learning Chinese word embeddings with stroke n-gram information. In: AAAI, pp. 5053–5061 (2018)
Google Scholar
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: IJCAI, pp. 1236–1242 (2015)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)
Article MathSciNet Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: UAI, pp. 289–296 (1999)
Google Scholar
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991 (2015)
Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: NIPS, pp. 2177–2185 (2014)
Google Scholar
Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI, pp. 3650–3656 (2015)
Google Scholar
Luong, M.T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: CoNLL, pp. 104–113 (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Peng, Y., Jiang, H.: Leverage financial news to predict stock price movements using word embeddings and deep neural networks. In: NAACL-HLT, pp. 374–379 (2016)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Salah, A., Ailem, M., Nadif, M.: Word co-occurrence regularized non-negative matrix tri-factorization for text data co-clustering. In: AAAI, pp. 3992–3999 (2018)
Google Scholar
Salle, A., Idiart, M., Villavicencio, A.: Matrix factorization using window sampling and negative sampling for improved word representations. arXiv preprint arXiv:1606.00819 (2016)
Su, T.R., Lee, H.Y.: Learning Chinese word representations from glyphs of characters. arXiv preprint arXiv:1708.04755 (2017)
Sun, Y., et al.: ERNIE 2.0: a continual pre-training framework for language understanding. In: AAAI, pp. 8968–8975 (2020)
Google Scholar
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: ACL, pp. 1555–1565 (2014)
Google Scholar
Xu, J., Liu, J., Zhang, L., Li, Z., Chen, H.: Improve Chinese word embeddings by exploiting internal structure. In: NAACL-HLT, pp. 1041–1050 (2016)
Google Scholar
Xun, G., Li, Y., Gao, J., Zhang, A.: Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts. In: SIGKDD, pp. 535–543 (2017)
Google Scholar
Yang, Q., Xie, H., Cheng, G., Wang, F.L., Rao, Y.: Pronunciation-enhanced Chinese word embedding. Cogn. Comput. 13(3), 688–697 (2021)
Article Google Scholar
Yu, J., Jian, X., Xin, H., Song, Y.: Joint embeddings of Chinese words, characters, and fine-grained subcharacter components. In: EMNLP, pp. 286–291 (2017)
Google Scholar
Zhang, Y., et al.: Learning Chinese word embeddings from stroke, structure and pinyin of characters. In: CIKM, pp. 1011–1020 (2019)
Google Scholar

Download references

Acknowledgment

We are grateful to the reviewers for their valuable comments. This work has been supported by the National Natural Science Foundation of China (61972426) and Guangdong Basic and Applied Basic Research Foundation (2020A1515010536).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Shurui Chen, Yufu Chen, Yuyin Lu & Yanghui Rao
Department of Computing and Decision Sciences, Lingnan University Tuen Mun, New Territories, Hong Kong
Haoran Xie
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Qing Li

Authors

Shurui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yufu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuyin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yanghui Rao
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Xie
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanghui Rao .

Editor information

Editors and Affiliations

University of Macau, Macau, China
Leong Hou U
University of Caen Normandie, Caen, France
Marc Spaniol
Osaka University, Osaka, Japan
Yasushi Sakurai
South China University of Technology, Guangzhou, China
Junying Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, S., Chen, Y., Lu, Y., Rao, Y., Xie, H., Li, Q. (2021). Chinese Word Embedding Learning with Limited Data. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds) Web and Big Data. APWeb-WAIM 2021. Lecture Notes in Computer Science(), vol 12858. Springer, Cham. https://doi.org/10.1007/978-3-030-85896-4_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-85896-4_18
Published: 19 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85895-7
Online ISBN: 978-3-030-85896-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Chinese Word Embedding Learning with Limited Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Chinese word embeddings from semantic and phonetic components

Radical and Stroke-Enhanced Chinese Word Embeddings Based on Neural Networks

Improved Learning of Chinese Word Embeddings with Semantic Knowledge

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Chinese Word Embedding Learning with Limited Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Chinese word embeddings from semantic and phonetic components

Radical and Stroke-Enhanced Chinese Word Embeddings Based on Neural Networks

Improved Learning of Chinese Word Embeddings with Semantic Knowledge

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation