Abstract
Lexical similarity data, quantifying the “proximity” of languages based on the similarity of their lexicons, has been increasingly used to estimate the cross-lingual reusability of language resources, for tasks such as bilingual lexicon induction or cross-lingual transfer. Existing similarity data, however, originates from the field of comparative linguistics, computed from very small expert-curated vocabularies that are not supposed to be representative of modern lexicons. We explore a different, fully automated approach to lexical similarity computation, based on an existing 8-million-entry cognate database created from online lexicons orders of magnitude larger than the word lists typically used in linguistics. We compare our results to earlier efforts, and automatically produce intuitive visualizations that have traditionally been hand-crafted. With a new, freely available database of over 27 thousand language pairs over 331 languages, we hope to provide more relevant data to cross-lingual NLP applications, as well as material for the synchronic study of contemporary lexicons.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
The full graphs are visible on the page http://ukc.datascientia.eu/lexdist.
- 8.
- 9.
References
Batsuren, K., Bella, G., Giunchiglia, F.: Cognet: a large-scale cognate database. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3136–3145 (2019)
Batsuren, K., Bella, G., Giunchiglia, F.: A large and evolving cognate database. Lang. Resour. Eval. 1–25(2021). https://doi.org/10.1007/s10579-021-09544-6
Batsuren, K., Ganbold, A., Chagnaa, A., Giunchiglia, F.: Building the Mongolian wordnet. In: Proceedings of the 10th Global Wordnet Conference, pp. 238–244 (2019)
Bella, G., et al.: A major wordnet for a minority language: Scottish Gaelic. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 2812–2818 (2020)
Comrie, B.: The World Atlas of Language Structures. Oxford University Press, Oxford (2005)
Dellert, J., et al.: Northeuralex: a wide-coverage lexical database of Northern Eurasia. Lang. Resour. Eval. 54(1), 273–301 (2020)
Garcia, M., Gómez-Rodríguez, C., Alonso, M.A.: New treebank or repurposed? on the feasibility of cross-lingual parsing of romance languages with universal dependencies. Nat. Lang. Eng. 24(1), 91–122 (2018)
Giunchiglia, F., Batsuren, K., Bella, G.: Understanding and exploiting language diversity. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), pp. 4009–4017 (2017)
Giunchiglia, F., Batsuren, K., Freihat, A.A.: One world–seven thousand languages. In: Proceedings 19th International Conference on Computational Linguistics and Intelligent Text Processing (CiCling 2018), 18–24 March 2018 (2018)
Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PLoS ONE 9(6), e98679 (2014)
Lin, Y.H., et al.: Choosing transfer languages for cross-lingual learning. arXiv preprint arXiv:1905.12688 (2019)
Nair, N.C., Velayuthan, R.S., Batsuren, K.: Aligning the indowordnet with the Princeton wordnet. In: Proceedings of the 3rd International Conference on Natural Language and Speech Processing, pp. 9–16 (2019)
Nasution, A.H., Murakami, Y., Ishida, T.: Constraint-based bilingual lexicon induction for closely related languages. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 3291–3298 (2016)
Petroni, F., Serva, M.: Measures of lexical distance between languages. Phys. Stat. Mech. Appl. 389(11), 2280–2283 (2010)
Pociello, E., Agirre, E., Aldezabal, I.: Methodology and construction of the basque wordnet. Lang. Resour. Eval. 45(2), 121–142 (2011)
Swadesh, M.: Towards greater accuracy in lexicostatistic dating. Int. J. Am. linguist. 21(2), 121–137 (1955)
Wichmann, S., et al.: The ASIP database (version 13). http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm 3 (2010)
Acknowledgments
This paper was partly supported by the InteropEHRate project, co-funded by the European Union (EU) Horizon 2020 programme under grant number 826106.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bella, G., Batsuren, K., Giunchiglia, F. (2021). A Database and Visualization of the Similarity of Contemporary Lexicons. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-83527-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)