Abstract
The mapping based methods for inducing a cross-lingual embedding space involves learning a linear mapping from the individual monolingual embedding spaces to a shared semantic space where English is often chosen as the hub language. Such methods are based on the orthogonal assumption. Resource limitations and typological distance from English often result in a deviation from this assumption and subsequently poor performance for the low-resource languages. In this research, we will present a method for identifying optimal bridge languages to achieve better mapping for the low-resource languages in the cross-lingual embedding space. We also report Bilingual Induction Task (BLI) performances for the shared semantic space achieved using different cross-lingual signals.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1006–1011 (2015)
Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)
Zhang, M., Liu, Y., Luan, H., Sun, M.: Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1934–1945 (2017)
Søgaard, A., Ruder, S., Vulić, I.: On the limitations of unsupervised bilingual dictionary induction. arXiv preprint arXiv:1805.03620 (2018)
Ahia, O., Kreutzer, J., Hooker, S.: The low-resource double bind: an empirical study of pruning for low-resource machine translation. arXiv preprint arXiv:2110.03036 (2021)
Oughton, E.: Policy options for digital infrastructure strategies: a simulation model for broadband universal service in Africa. arXiv preprint arXiv:2102.03561 (2021)
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText.zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Conneau, A., Lample, G., Ranzato, M.A., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Comrie, B. (ed.): The World’s Major Languages. Routledge, London (1987)
Vulić, I., Ruder, S., Søgaard, A.: Are all good word vector spaces isomorphic?. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3178–3192 (2020)
Zadeh, L. A.: Fuzzy sets. In: Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers by Lotfi A Zadeh, pp. 394–432 (1996)
Dunn, J. C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, pp. 32–57 (1973)
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer, Heidelberg (2013)
Anastasopoulos, A., Neubig, G.: Should all cross-lingual embeddings speak English?. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8658–8679 (2020)
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
Nakashole, N., Flauger, R.: Knowledge distillation for bilingual dictionary induction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2497–2506 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Vancouver (2018)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)
Wu, S., Dredze, M.: Are all languages created equal in multilingual BERT? arXiv preprint arXiv:2005.09093 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bhowmik, K., Ralescu, A. (2023). Bridging the Resource Gap in Cross-Lingual Embedding Space. In: Simian, D., Stoica, L.F. (eds) Modelling and Development of Intelligent Systems. MDIS 2022. Communications in Computer and Information Science, vol 1761. Springer, Cham. https://doi.org/10.1007/978-3-031-27034-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-27034-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27033-8
Online ISBN: 978-3-031-27034-5
eBook Packages: Computer ScienceComputer Science (R0)