Abstract
Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Hubness is an issue observed in high-dimensional space where some points are the nearest neighbours of many other points. [17].
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/D16-1250
Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5012–5019 (2018). https://doi.org/10.1609/aaai.v32i1.11992
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-1073
Artetxe, M., Labaka, G., Agirre, E.: Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3632–3642. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1399
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., J’egou, H.: Word translation without parallel data. arXiv abs/1710.04087 (2017). https://doi.org/10.48550/arXiv.1710.04087
Denisová, M.: Parallel, or comparable? That is the question: the comparison of parallel and comparable data-based methods for bilingual lexicon induction. In: Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing. RASLAN 2022, pp. 3–13. Tribun EU (2022)
Denisová, M., Rychlý, P.: When word pairs matter: analysis of the English-Slovak evaluation dataset. In: Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), pp. 141–149. Tribun EU, Brno (2021)
Duan, X., et al.: Bilingual dictionary based neural machine translation without using parallel sentences. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1570–1579. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.143
Glavaš, G., Litschko, R., Ruder, S., Vulić, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 710–721. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1070
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., Grave, E.: Loss in translation: learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2979–2984. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1330
Kementchedjhieva, Y., Hartmann, M., Søgaard, A.: Lost in evaluation: misleading benchmarks for bilingual dictionary induction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3336–3341. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1328
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition, pp. 9–16. Association for Computational Linguistics (2002). https://doi.org/10.3115/1118627.1118629
Kovář, V., Baisa, V., Jakubíček, M.: Sketch engine for bilingual lexicography. Int. J. Lexicogr. 29(3), 339–352 (2016). https://doi.org/10.1093/ijl/ecw029
Li, Y., Korhonen, A., Vulić, I.: On bilingual lexicon induction with large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9577–9599. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.emnlp-main.595
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010). https://doi.org/10.5555/1756006.1953015
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019). https://doi.org/10.1613/jair.1.11640
Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces, vol. V, pp. 237–248. Recent Advances in Natural Language Processing (2009)
Tiedemann, J., Thottingal, S.: OPUS-MT – building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pp. 479–480. European Association for Machine Translation, Lisboa, Portugal (2020)
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372 (2015). https://doi.org/10.1145/2766462.2767752
Yuan, M., Zhang, M., Van Durme, B., Findlater, L., Boyd-Graber, J.: Interactive refinement of cross-lingual word embeddings. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5984–5996. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.482
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Denisová, M., Rychlý, P. (2024). Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-70563-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70562-5
Online ISBN: 978-3-031-70563-2
eBook Packages: Computer ScienceComputer Science (R0)