CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Yu, Shenglong; Guo, Wenya; Zhang, Ying; Yuan, Xiaojie

doi:10.1007/978-3-031-44696-2_30

Shenglong Yu¹¹,
Wenya Guo¹¹,
Ying Zhang¹¹ &
…
Xiaojie Yuan¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1373 Accesses

Abstract

Unsupervised bilingual lexicon induction is a crucial and challenging task in multilingual NLP, which aims to induce word translation by aligning monolingual word embeddings. Existing works treat word pairs equally and lack consideration of word pair credibility, which further leads to the fact that these methods are limited to global operations on static word embeddings and fail when using pre-trained language models. To address this problem, we propose confidence-based dual refinement for unsupervised bilingual lexicon induction, where embeddings are refined by dual aspects (static word embeddings and pre-trained models) based on the confidence of word pairs, i.e., the credibility of word pairs in correct alignment. For static word embeddings, instead of global operations, we calculate personalized mappings for different words based on confidence. For pre-trained language models, we fine-tune the model with positive and negative samples generated according to confidence. Finally, we combine the output of both aspects as the final result. Extensive experimental results on public datasets, including both rich-resource and low-resource languages, demonstrate the superiority of our proposal.

We thank anonymous reviewers for their valuable comments. This work was supported by the Natural Science Foundation of Tianjin, China (No. 22JCQNJC01580, 22JCJQJC00150), the Fundamental Research Funds for the Central Universities (No. 63231149), and the National Natural Science Foundation of China (No. 62272250).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Article 13 May 2024

Fast and Accurate Bilingual Lexicon Induction via Matching Optimization

Improving Bilingual Lexicon Induction on Distant Language Pairs

References

Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297 (2018)
Chen, X., Cardie, C.: Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933 (2018)
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Duan, X., et al.: Bilingual dictionary based neural machine translation without using parallel sentences. arXiv preprint arXiv:2007.02671 (2020)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 526–533 (2004)
Google Scholar
Glavaš, G., Vulić, I.: Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7548–7555 (2020)
Google Scholar
Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein procrustes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1880–1890. PMLR (2019)
Google Scholar
Heyman, G., Vulić, I., Moens, M.F.: Bilingual lexicon induction by learning to combine word-level and character-level representations. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1085–1095 (2017)
Google Scholar
Jawanpuria, P., Meghwanshi, M., Mishra, B.: Geometry-aware domain adaptation for unsupervised alignment of word embeddings. arXiv preprint arXiv:2004.08243 (2020)
Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., Grave, E.: Loss in translation: learning bilingual word mapping with a retrieval criterion. arXiv preprint arXiv:1804.07745 (2018)
Klementiev, A., Titov, I., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: Proceedings of OLING 2012, pp. 1459–1474 (2012)
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019)
Li, Y., Liu, F., Collier, N., Korhonen, A., Vulić, I.: Improving word translation via two-stage contrastive learning. arXiv preprint arXiv:2203.08307 (2022)
Li, Y., Liu, F., Vulić, I., Korhonen, A.: Improving bilingual lexicon induction with cross-encoder reranking. arXiv preprint arXiv:2210.16953 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Mohiuddin, T., Bari, M.S., Joty, S.: LNMap: departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. arXiv preprint arXiv:2004.13889 (2020)
Mohiuddin, T., Joty, S.: Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. arXiv preprint arXiv:1904.04116 (2019)
Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R., Neubig, G.: Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. arXiv preprint arXiv:1908.06625 (2019)
Qi, Y., Sachan, D.S., Felix, M., Padmanabhan, S.J., Neubig, G.: When and why are pre-trained word embeddings useful for neural machine translation? arXiv preprint arXiv:1804.06323 (2018)
Rapp, R.: Identifying word translations in non-parallel texts. arXiv preprint cmp-lg/9505037 (1995)
Google Scholar
Ren, S., Liu, S., Zhou, M., Ma, S.: A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3476–3485 (2020)
Google Scholar
Shi, H., Zettlemoyer, L., Wang, S.I.: Bilingual lexicon induction via unsupervised bitext construction and word alignment. arXiv preprint arXiv:2101.00148 (2021)
Søgaard, A., Ruder, S., Vulić, I.: On the limitations of unsupervised bilingual dictionary induction. arXiv preprint arXiv:1805.03620 (2018)
Tian, Z., et al.: RAPO: an adaptive ranking paradigm for bilingual lexicon induction. arXiv preprint arXiv:2210.09926 (2022)
Xiao, M., Guo, Y.: Distributed word representation learning for cross-lingual dependency parsing. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 119–129 (2014)
Google Scholar
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020)
Zhang, J., et al.: Combining static word embeddings and contextual representations for bilingual lexicon induction. arXiv preprint arXiv:2106.03084 (2021)
Zhao, X., Wang, Z., Wu, H., Zhang, Y.: Semi-supervised bilingual lexicon induction with two-way interaction. arXiv preprint arXiv:2010.07101 (2020)

Download references

Author information

Authors and Affiliations

College of Computer Science, Nankai University, Tianjin, 300350, China
Shenglong Yu, Wenya Guo, Ying Zhang & Xiaojie Yuan

Authors

Shenglong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wenya Guo
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Zhang .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, S., Guo, W., Zhang, Y., Yuan, X. (2023). CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-44696-2_30
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction