Abstract
Bilingual word representations (BWRs) play a very key role in many natural language processing (NLP) tasks, especially cross-lingual applications such as machine translation and cross-lingual information retrieval et al. Most existing methods are based on offline unsupervised methods to learn BWRs. Those offline methods mainly rely on the isomorphic assumption that word representations have a similar distribution for different languages. Several authors also question this assumption and argue that word representation spaces are non-isomorphic for many language pairs. In this paper, we adopt a novel unsupervised method to implement joint training BWRs. We first use a dynamic programming algorithm to detect continuous bilingual segments. Then, we use the extracted bilingual data and monolingual corpora to train BWRs jointly. Experiments show that our approach improves the performance of BWRs compared with several baselines in the real-world dataset.(By unsupervised, we mean that no cross-lingual resources like parallel text or bilingual lexicons are directly used.)
Supported by Northwestern Polytechnical University and Zhejiang University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We mean that the etymologically close languages are closely-related languages such as English-French. The distant languages are etymologically different such as English-Chinese.
- 2.
- 3.
Most of the related works extract parallel sentences to improve machine translation system. Recall of extracted parallel data also is important. Ours only consider obtaining some (rather than all) good quality parallel data (words or phrases), parallel sentences are not necessary.
- 4.
In this paper, we define a phrase that contains three words at least. We also test the different number of words how to affect the results in the experimental section.
- 5.
- 6.
- 7.
- 8.
References
Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)
Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 451–462 (2017)
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 789–798 (2018)
Braune, F., Hangya, V., Eder, T., Fraser, A.: Evaluating bilingual word embeddings on the long tail. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 188–193 (2018)
Bérard, A., Servan, C., Pietquin, O., Besacier, L.: MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In: The 10th edition of the Language Resources and Evaluation Conference (LREC 2016), May 2016
Gai, K., Qiu, M.: Optimal resource allocation using reinforcement learning for IoT content-centric services. Appl. Soft Comput. 70, 12–21 (2018)
Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)
Gai, K., Qiu, M., Zhao, H., Sun, X.: Resource management in sustainable cyber-physical systems using heterogeneous cloud computing. IEEE Trans. Sustain. Comput. 3(2), 60–72 (2017)
Glavaš, G., Litschko, R., Ruder, S., Vulić, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 710–721 (2019)
Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein procrustes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1880–1890 (2019)
Hangya, V., Braune, F., Kalasouskaya, Y., Fraser, A.: Unsupervised parallel sentence extraction from comparable corpora (2018)
Hangya, V., Fraser, A.: Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1224–1234 (2019)
Keung, P., Salazar, J., Lu, Y., Smith, N.A.: Unsupervised bitext mining and translation via self-trained contextual embeddings. arXiv preprint arXiv:2010.07761 (2020)
Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: International Conference on Learning Representations (2018)
Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5039–5049 (2018)
Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I.: Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1253–1256 (2018)
Litschko, R., Glavaš, G., Vulic, I., Dietz, L.: Evaluating resource-lean cross-lingual embedding models in unsupervised retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109–1112 (2019)
Luong, M.T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159 (2015)
Marie, B., Fujita, A.: Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 392–398 (2017)
Marie, B., Fujita, A.: Unsupervised joint training of bilingual word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3224–3230 (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Ormazabal, A., Artetxe, M., Labaka, G., Soroa, A., Agirre, E.: Analyzing the limitations of cross-lingual word embedding mappings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4990–4995 (2019)
Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R., Neubig, G.: Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 184–193 (2019)
Ren, S., Liu, S., Zhou, M., Ma, S.: A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3476–3485 (2020)
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T.: Unsupervised bilingual word embedding agreement for unsupervised neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1235–1245 (2019)
Vulić, I., Glavaš, G., Reichart, R., Korhonen, A.: Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4398–4409 (2019)
Zhao, X., Wang, Z., Zhang, Y., Wu, H.: A relaxed matching procedure for unsupervised BLI. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3036–3041 (2020)
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61906158), the Project of Science and Technology Research in Henan Province (212102210075).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, S., Mi, C., Zhang, L. (2021). Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12815. Springer, Cham. https://doi.org/10.1007/978-3-030-82136-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-82136-4_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82135-7
Online ISBN: 978-3-030-82136-4
eBook Packages: Computer ScienceComputer Science (R0)