ABSTRACT
Differences in languages among various countries, regions, and nationalities have created huge obstacles in communication. Cross-language word similarity (CLWS) calculation is the most practical way to solve this problem. Selection of corpus is one of the factors that influence the calculate result. This paper compares the similarity in word embeddings of bilingual parallel and non-parallel corpus on traditional models. Firstly, this paper uses the fastText method to calculate the monolingual word embeddings of Chinese and English, and computes the semantic similarity between the two embeddings. Then it maps the word embeddings into an implicit shared space using Multilingual Unsupervised and Supervised Embedding (MUSE), and compares the effect of unsupervised and supervised machine learning methods in parallel and non-parallel corpus. Finally, the experimental results prove that MUSE model could be better align monolingual word embeddings space, non-parallel corpus have the same effect compares with parallel corpus in calculating the CLWS.
- Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. International Conference on Machine Learning, 1998, 1 (2):296--304. Google ScholarDigital Library
- Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. 2018. Word Translation Without Parallel Data. Proceedings of ICLR 2018.Google Scholar
- Che Wanxiang, Liu Ting, Qin Bing. 2004. Similar Chinese Sentence Retrieval Based on Improved Edit-distance. High technology letters.2004, 14(7).Google Scholar
- Yu Zhengtao, Fan Xiaozhong, Guo Jianyi, Geng Zengmin. 2006. Answer Extracting for Chinese QuestionAnswering System Based on Latent Semantic Analysis. Chinese journal of computers. 2006, 29(10).Google Scholar
- David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993--1022. Google ScholarDigital Library
- Yang Sichun, 2006. An Improved Model for Sentence Similarity Computing. Journal of university of electronic science and technology of China. 2006, 35(6).Google Scholar
- TIAN Jiule, Zhao Wei. 2010. Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System. Journal of Jilin University (Information Science Edition). 2010--6.Google Scholar
- Cheng Tao, Shi Shuicai, Wang Xia, Lu Xueqiang. 2007. Thematic Words Extracting from Chinese Text Based on Tongyici Cilin. Journal of Guangxi normal university (natural science edition). 2007, 25(2).Google Scholar
- Gao Sidan, 2008. Automatic test paper correction for subjective questions based on natural language understanding. Thesis in Nanjing University.Google Scholar
- CAI Dongfeng, Bai Yu, Yu Shui, Ye Na, Rren Xiaona. 2010. A Context Based Word Similarity Computing Method. Journal of Chinese Information Processing. 2010--3.Google Scholar
- YUAN Xiaofeng. 2011. Research of Word Similarity Calculation Based on HowNet. Journal of Chengdu University (Natural Science Edition). 2011--4.Google Scholar
- WANG Xiaolin, Wang Yi. 2011. Improved word similarity algorithm based on HowNet. Journal of Computer Applications. 2011--11.Google Scholar
- Lei Li, Zhiqing Wang. 2012. Chinese Word Similarity Computing. Proceedings of 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content. 2012--9.Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, pp. 3111--3119. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of EMNLP, 14:1532--1543.Google Scholar
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135--146.Google ScholarCross Ref
- Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146--162.Google ScholarCross Ref
- Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. Proceedings of arXiv.Google Scholar
- Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google Scholar
- Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. Proceedings of EACL.Google ScholarCross Ref
- Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and pollution: Delving into crossspace mapping for zero-shot learning. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref
- Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. International Conference on Learning Representations.Google Scholar
- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451--462. Association for Computational Linguistics.Google ScholarCross Ref
- Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng. 2016. A distribution-based model to learn bilingual word embeddings. Proceedings of COLING.Google Scholar
- Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017b. Adversarial training for unsupervised bilingual lexicon induction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref
Index Terms
- English-Chinese Cross Language Word Embedding Similarity Calculation
Recommendations
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Measuring Chinese-English cross-lingual word similarity with HowNet and parallel corpus
CICLing'11: Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part IICross-lingual word similarity (CLWS) is a basic component in cross-lingual information access systems. Designing a CLWS measure faces three challenges: (i) Cross-lingual knowledge base is rare; (ii) Cross-lingual corpora are limited; and (iii) No ...
Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languagesWe investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
Comments