skip to main content
10.1145/3299819.3299831acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaicccConference Proceedingsconference-collections
research-article

English-Chinese Cross Language Word Embedding Similarity Calculation

Authors Info & Claims
Published:21 December 2018Publication History

ABSTRACT

Differences in languages among various countries, regions, and nationalities have created huge obstacles in communication. Cross-language word similarity (CLWS) calculation is the most practical way to solve this problem. Selection of corpus is one of the factors that influence the calculate result. This paper compares the similarity in word embeddings of bilingual parallel and non-parallel corpus on traditional models. Firstly, this paper uses the fastText method to calculate the monolingual word embeddings of Chinese and English, and computes the semantic similarity between the two embeddings. Then it maps the word embeddings into an implicit shared space using Multilingual Unsupervised and Supervised Embedding (MUSE), and compares the effect of unsupervised and supervised machine learning methods in parallel and non-parallel corpus. Finally, the experimental results prove that MUSE model could be better align monolingual word embeddings space, non-parallel corpus have the same effect compares with parallel corpus in calculating the CLWS.

References

  1. Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. International Conference on Machine Learning, 1998, 1 (2):296--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. 2018. Word Translation Without Parallel Data. Proceedings of ICLR 2018.Google ScholarGoogle Scholar
  3. Che Wanxiang, Liu Ting, Qin Bing. 2004. Similar Chinese Sentence Retrieval Based on Improved Edit-distance. High technology letters.2004, 14(7).Google ScholarGoogle Scholar
  4. Yu Zhengtao, Fan Xiaozhong, Guo Jianyi, Geng Zengmin. 2006. Answer Extracting for Chinese QuestionAnswering System Based on Latent Semantic Analysis. Chinese journal of computers. 2006, 29(10).Google ScholarGoogle Scholar
  5. David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yang Sichun, 2006. An Improved Model for Sentence Similarity Computing. Journal of university of electronic science and technology of China. 2006, 35(6).Google ScholarGoogle Scholar
  7. TIAN Jiule, Zhao Wei. 2010. Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System. Journal of Jilin University (Information Science Edition). 2010--6.Google ScholarGoogle Scholar
  8. Cheng Tao, Shi Shuicai, Wang Xia, Lu Xueqiang. 2007. Thematic Words Extracting from Chinese Text Based on Tongyici Cilin. Journal of Guangxi normal university (natural science edition). 2007, 25(2).Google ScholarGoogle Scholar
  9. Gao Sidan, 2008. Automatic test paper correction for subjective questions based on natural language understanding. Thesis in Nanjing University.Google ScholarGoogle Scholar
  10. CAI Dongfeng, Bai Yu, Yu Shui, Ye Na, Rren Xiaona. 2010. A Context Based Word Similarity Computing Method. Journal of Chinese Information Processing. 2010--3.Google ScholarGoogle Scholar
  11. YUAN Xiaofeng. 2011. Research of Word Similarity Calculation Based on HowNet. Journal of Chengdu University (Natural Science Edition). 2011--4.Google ScholarGoogle Scholar
  12. WANG Xiaolin, Wang Yi. 2011. Improved word similarity algorithm based on HowNet. Journal of Computer Applications. 2011--11.Google ScholarGoogle Scholar
  13. Lei Li, Zhiqing Wang. 2012. Chinese Word Similarity Computing. Proceedings of 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content. 2012--9.Google ScholarGoogle ScholarCross RefCross Ref
  14. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, pp. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR.Google ScholarGoogle Scholar
  16. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of EMNLP, 14:1532--1543.Google ScholarGoogle Scholar
  17. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135--146.Google ScholarGoogle ScholarCross RefCross Ref
  18. Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146--162.Google ScholarGoogle ScholarCross RefCross Ref
  19. Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. Proceedings of arXiv.Google ScholarGoogle Scholar
  20. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google ScholarGoogle Scholar
  21. Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. Proceedings of EACL.Google ScholarGoogle ScholarCross RefCross Ref
  22. Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and pollution: Delving into crossspace mapping for zero-shot learning. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  23. Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. International Conference on Learning Representations.Google ScholarGoogle Scholar
  24. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451--462. Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  25. Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng. 2016. A distribution-based model to learn bilingual word embeddings. Proceedings of COLING.Google ScholarGoogle Scholar
  26. Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017b. Adversarial training for unsupervised bilingual lexicon induction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. English-Chinese Cross Language Word Embedding Similarity Calculation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference
      December 2018
      206 pages
      ISBN:9781450366236
      DOI:10.1145/3299819

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 December 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader