Skip to main content

Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2024)

Abstract

Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/x-mia/marianmt-bli.

  2. 2.

    Hubness is an issue observed in high-dimensional space where some points are the nearest neighbours of many other points. [17].

  3. 3.

    https://marian-nmt.github.io/.

  4. 4.

    https://opus.nlpl.eu/.

  5. 5.

    http://www.eki.ee/dict/ies/.

  6. 6.

    http://www.eki.ee/dict/efi/.

  7. 7.

    https://portaal.eki.ee/dict/evs/.

  8. 8.

    https://creativecommons.org/licenses/by/4.0/.

  9. 9.

    https://huggingface.co/.

References

  1. Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/D16-1250

  2. Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5012–5019 (2018). https://doi.org/10.1609/aaai.v32i1.11992

  3. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-1073

  4. Artetxe, M., Labaka, G., Agirre, E.: Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3632–3642. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1399

  5. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., J’egou, H.: Word translation without parallel data. arXiv abs/1710.04087 (2017). https://doi.org/10.48550/arXiv.1710.04087

  6. Denisová, M.: Parallel, or comparable? That is the question: the comparison of parallel and comparable data-based methods for bilingual lexicon induction. In: Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing. RASLAN 2022, pp. 3–13. Tribun EU (2022)

    Google Scholar 

  7. Denisová, M., Rychlý, P.: When word pairs matter: analysis of the English-Slovak evaluation dataset. In: Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), pp. 141–149. Tribun EU, Brno (2021)

    Google Scholar 

  8. Duan, X., et al.: Bilingual dictionary based neural machine translation without using parallel sentences. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1570–1579. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.143

  9. Glavaš, G., Litschko, R., Ruder, S., Vulić, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 710–721. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1070

  10. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  11. Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., Grave, E.: Loss in translation: learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2979–2984. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1330

  12. Kementchedjhieva, Y., Hartmann, M., Søgaard, A.: Lost in evaluation: misleading benchmarks for bilingual dictionary induction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3336–3341. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1328

  13. Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition, pp. 9–16. Association for Computational Linguistics (2002). https://doi.org/10.3115/1118627.1118629

  14. Kovář, V., Baisa, V., Jakubíček, M.: Sketch engine for bilingual lexicography. Int. J. Lexicogr. 29(3), 339–352 (2016). https://doi.org/10.1093/ijl/ecw029

  15. Li, Y., Korhonen, A., Vulić, I.: On bilingual lexicon induction with large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9577–9599. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.emnlp-main.595

  16. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)

  17. Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010). https://doi.org/10.5555/1756006.1953015

    Article  MathSciNet  Google Scholar 

  18. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019). https://doi.org/10.1613/jair.1.11640

    Article  MathSciNet  Google Scholar 

  19. Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces, vol. V, pp. 237–248. Recent Advances in Natural Language Processing (2009)

    Google Scholar 

  20. Tiedemann, J., Thottingal, S.: OPUS-MT – building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pp. 479–480. European Association for Machine Translation, Lisboa, Portugal (2020)

    Google Scholar 

  21. Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372 (2015). https://doi.org/10.1145/2766462.2767752

  22. Yuan, M., Zhang, M., Van Durme, B., Findlater, L., Boyd-Graber, J.: Interactive refinement of cross-lingual word embeddings. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5984–5996. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.482

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Michaela Denisová or Pavel Rychlý .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Denisová, M., Rychlý, P. (2024). Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70563-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70562-5

  • Online ISBN: 978-3-031-70563-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics