Skip to main content

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

  • 804 Accesses

Abstract

Unsupervised bilingual lexicon induction is a crucial and challenging task in multilingual NLP, which aims to induce word translation by aligning monolingual word embeddings. Existing works treat word pairs equally and lack consideration of word pair credibility, which further leads to the fact that these methods are limited to global operations on static word embeddings and fail when using pre-trained language models. To address this problem, we propose confidence-based dual refinement for unsupervised bilingual lexicon induction, where embeddings are refined by dual aspects (static word embeddings and pre-trained models) based on the confidence of word pairs, i.e., the credibility of word pairs in correct alignment. For static word embeddings, instead of global operations, we calculate personalized mappings for different words based on confidence. For pre-trained language models, we fine-tune the model with positive and negative samples generated according to confidence. Finally, we combine the output of both aspects as the final result. Extensive experimental results on public datasets, including both rich-resource and low-resource languages, demonstrate the superiority of our proposal.

We thank anonymous reviewers for their valuable comments. This work was supported by the Natural Science Foundation of Tianjin, China (No. 22JCQNJC01580, 22JCJQJC00150), the Fundamental Research Funds for the Central Universities (No. 63231149), and the National Natural Science Foundation of China (No. 62272250).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)

    Google Scholar 

  2. Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  3. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297 (2018)

  4. Chen, X., Cardie, C.: Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933 (2018)

  5. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Duan, X., et al.: Bilingual dictionary based neural machine translation without using parallel sentences. arXiv preprint arXiv:2007.02671 (2020)

  8. Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 526–533 (2004)

    Google Scholar 

  9. Glavaš, G., Vulić, I.: Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7548–7555 (2020)

    Google Scholar 

  10. Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein procrustes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1880–1890. PMLR (2019)

    Google Scholar 

  11. Heyman, G., Vulić, I., Moens, M.F.: Bilingual lexicon induction by learning to combine word-level and character-level representations. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1085–1095 (2017)

    Google Scholar 

  12. Jawanpuria, P., Meghwanshi, M., Mishra, B.: Geometry-aware domain adaptation for unsupervised alignment of word embeddings. arXiv preprint arXiv:2004.08243 (2020)

  13. Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., Grave, E.: Loss in translation: learning bilingual word mapping with a retrieval criterion. arXiv preprint arXiv:1804.07745 (2018)

  14. Klementiev, A., Titov, I., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: Proceedings of OLING 2012, pp. 1459–1474 (2012)

    Google Scholar 

  15. Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019)

  16. Li, Y., Liu, F., Collier, N., Korhonen, A., Vulić, I.: Improving word translation via two-stage contrastive learning. arXiv preprint arXiv:2203.08307 (2022)

  17. Li, Y., Liu, F., Vulić, I., Korhonen, A.: Improving bilingual lexicon induction with cross-encoder reranking. arXiv preprint arXiv:2210.16953 (2022)

  18. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  19. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)

  20. Mohiuddin, T., Bari, M.S., Joty, S.: LNMap: departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. arXiv preprint arXiv:2004.13889 (2020)

  21. Mohiuddin, T., Joty, S.: Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. arXiv preprint arXiv:1904.04116 (2019)

  22. Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R., Neubig, G.: Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. arXiv preprint arXiv:1908.06625 (2019)

  23. Qi, Y., Sachan, D.S., Felix, M., Padmanabhan, S.J., Neubig, G.: When and why are pre-trained word embeddings useful for neural machine translation? arXiv preprint arXiv:1804.06323 (2018)

  24. Rapp, R.: Identifying word translations in non-parallel texts. arXiv preprint cmp-lg/9505037 (1995)

    Google Scholar 

  25. Ren, S., Liu, S., Zhou, M., Ma, S.: A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3476–3485 (2020)

    Google Scholar 

  26. Shi, H., Zettlemoyer, L., Wang, S.I.: Bilingual lexicon induction via unsupervised bitext construction and word alignment. arXiv preprint arXiv:2101.00148 (2021)

  27. Søgaard, A., Ruder, S., Vulić, I.: On the limitations of unsupervised bilingual dictionary induction. arXiv preprint arXiv:1805.03620 (2018)

  28. Tian, Z., et al.: RAPO: an adaptive ranking paradigm for bilingual lexicon induction. arXiv preprint arXiv:2210.09926 (2022)

  29. Xiao, M., Guo, Y.: Distributed word representation learning for cross-lingual dependency parsing. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 119–129 (2014)

    Google Scholar 

  30. Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020)

  31. Zhang, J., et al.: Combining static word embeddings and contextual representations for bilingual lexicon induction. arXiv preprint arXiv:2106.03084 (2021)

  32. Zhao, X., Wang, Z., Wu, H., Zhang, Y.: Semi-supervised bilingual lexicon induction with two-way interaction. arXiv preprint arXiv:2010.07101 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, S., Guo, W., Zhang, Y., Yuan, X. (2023). CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44696-2_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44695-5

  • Online ISBN: 978-3-031-44696-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics