research-article

English-Chinese Cross Language Word Embedding Similarity Calculation

Authors:
Like Wang

School of Information Engineering, Minzu University of China, Beijing, China and Minority Languages Branch, National Language Resource and Monitoring Research Center, Minzu University of China, Beijing, China

School of Information Engineering, Minzu University of China, Beijing, China and Minority Languages Branch, National Language Resource and Monitoring Research Center, Minzu University of China, Beijing, China
View Profile

,
Yuan Sun

School of Information Engineering, Minzu University of China, Beijing, China and Minority Languages Branch, National Language Resource and Monitoring Research Center, Minzu University of China, Beijing, China

School of Information Engineering, Minzu University of China, Beijing, China and Minority Languages Branch, National Language Resource and Monitoring Research Center, Minzu University of China, Beijing, China
View Profile

,
Xiaobing Zhao

School of Information Engineering, Minzu University of China, Beijing, China and Minority Languages Branch, National Language Resource and Monitoring Research Center, Minzu University of China, Beijing, China

School of Information Engineering, Minzu University of China, Beijing, China and Minority Languages Branch, National Language Resource and Monitoring Research Center, Minzu University of China, Beijing, China
View Profile

AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing ConferenceDecember 2018Pages 134–138https://doi.org/10.1145/3299819.3299831

Published:21 December 2018Publication History

AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference

Pages 134–138

ABSTRACT

Differences in languages among various countries, regions, and nationalities have created huge obstacles in communication. Cross-language word similarity (CLWS) calculation is the most practical way to solve this problem. Selection of corpus is one of the factors that influence the calculate result. This paper compares the similarity in word embeddings of bilingual parallel and non-parallel corpus on traditional models. Firstly, this paper uses the fastText method to calculate the monolingual word embeddings of Chinese and English, and computes the semantic similarity between the two embeddings. Then it maps the word embeddings into an implicit shared space using Multilingual Unsupervised and Supervised Embedding (MUSE), and compares the effect of unsupervised and supervised machine learning methods in parallel and non-parallel corpus. Finally, the experimental results prove that MUSE model could be better align monolingual word embeddings space, non-parallel corpus have the same effect compares with parallel corpus in calculating the CLWS.

References

Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. International Conference on Machine Learning, 1998, 1 (2):296--304. Google ScholarDigital Library
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. 2018. Word Translation Without Parallel Data. Proceedings of ICLR 2018.Google Scholar
Che Wanxiang, Liu Ting, Qin Bing. 2004. Similar Chinese Sentence Retrieval Based on Improved Edit-distance. High technology letters.2004, 14(7).Google Scholar
Yu Zhengtao, Fan Xiaozhong, Guo Jianyi, Geng Zengmin. 2006. Answer Extracting for Chinese QuestionAnswering System Based on Latent Semantic Analysis. Chinese journal of computers. 2006, 29(10).Google Scholar
David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993--1022. Google ScholarDigital Library
Yang Sichun, 2006. An Improved Model for Sentence Similarity Computing. Journal of university of electronic science and technology of China. 2006, 35(6).Google Scholar
TIAN Jiule, Zhao Wei. 2010. Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System. Journal of Jilin University (Information Science Edition). 2010--6.Google Scholar
Cheng Tao, Shi Shuicai, Wang Xia, Lu Xueqiang. 2007. Thematic Words Extracting from Chinese Text Based on Tongyici Cilin. Journal of Guangxi normal university (natural science edition). 2007, 25(2).Google Scholar
Gao Sidan, 2008. Automatic test paper correction for subjective questions based on natural language understanding. Thesis in Nanjing University.Google Scholar
CAI Dongfeng, Bai Yu, Yu Shui, Ye Na, Rren Xiaona. 2010. A Context Based Word Similarity Computing Method. Journal of Chinese Information Processing. 2010--3.Google Scholar
YUAN Xiaofeng. 2011. Research of Word Similarity Calculation Based on HowNet. Journal of Chengdu University (Natural Science Edition). 2011--4.Google Scholar
WANG Xiaolin, Wang Yi. 2011. Improved word similarity algorithm based on HowNet. Journal of Computer Applications. 2011--11.Google Scholar
Lei Li, Zhiqing Wang. 2012. Chinese Word Similarity Computing. Proceedings of 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content. 2012--9.Google ScholarCross Ref
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, pp. 3111--3119. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of EMNLP, 14:1532--1543.Google Scholar
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135--146.Google ScholarCross Ref
Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146--162.Google ScholarCross Ref
Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. Proceedings of arXiv.Google Scholar
Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google Scholar
Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. Proceedings of EACL.Google ScholarCross Ref
Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and pollution: Delving into crossspace mapping for zero-shot learning. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref
Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. International Conference on Learning Representations.Google Scholar
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451--462. Association for Computational Linguistics.Google ScholarCross Ref
Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng. 2016. A distribution-based model to learn bilingual word embeddings. Proceedings of COLING.Google Scholar
Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017b. Adversarial training for unsupervised bilingual lexicon induction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref

Index Terms

English-Chinese Cross Language Word Embedding Similarity Calculation
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Unsupervised learning and clustering

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Read More
Measuring Chinese-English cross-lingual word similarity with HowNet and parallel corpus
CICLing'11: Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II

Cross-lingual word similarity (CLWS) is a basic component in cross-lingual information access systems. Designing a CLWS measure faces three challenges: (i) Cross-lingual knowledge base is rare; (ii) Cross-lingual corpora are limited; and (iii) No ...
Read More
Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages

We investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference
December 2018
206 pages
ISBN:9781450366236
DOI:10.1145/3299819

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 December 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-Language Word Similarity
Fasttext Method
MUSE Model
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 190
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

English-Chinese Cross Language Word Embedding Similarity Calculation

AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Measuring Chinese-English cross-lingual word similarity with HowNet and parallel corpus

Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

English-Chinese Cross Language Word Embedding Similarity Calculation

AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Measuring Chinese-English cross-lingual word similarity with HowNet and parallel corpus

Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media