Abstract
This paper presents a new dataset of noun-adjective multiword expressions with different degrees of compositionality and semantic ambiguity in Galician. It is composed of 240 MWEs, which can convey one or different senses depending on the context. For each sense, a language expert manually created two sentences and selected from corpora four additional examples that included the target MWEs, thus resulting in a useful resource for exploring potential data contamination when evaluating language models. Each MWE in context was then classified as idiomatic, partially idiomatic, or compositional. Therefore, the dataset comprises MWEs with stable meanings, and two types of ambiguous expressions: 1) potential idiomatic expressions (e.g., red flag), and 2) polysemy-based ambiguous MWEs, whose various senses are due to the ambiguity of one of the constituent words (e.g., common noun as a type of noun or a noun that is common). To illustrate the potential of this resource, a comparison of three BERT models for Galician was performed, shedding light on the representation of ambiguous MWEs in Transformers. This is a valuable resource for evaluating the semantic capabilities of current language models in a low-resource variety, bearing in mind that idiomaticity is one of the linguistic phenomena whose modeling poses the greatest challenges for computational approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The dataset can be found at https://github.com/Castro-L/MWE_dataset_gl.
References
Balloccu, S., Schmidtová, P., Lango, M., Dusek, O.: Leak, cheat, repeat: data contamination and evaluation malpractices in closed-source LLMs. In: Graham, Y., Purver, M. (eds.) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, pp. 67–93. ACL, March 2024
Bannard, C., Baldwin, T., Lascarides, A.: A statistical approach to the semantics of verb-particles. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 65–72. ACL, July 2003
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E.: Longman Grammar of Spoken and Written English, 1st edn. Pearson Education Ltd, Harlow (1999)
Constant, M., et al.: Survey: multiword expression processing: a survey. Comput. Linguist. 43(4), 837–892 (2017)
Cook, P., Fazly, A., Stevenson, S.: The VNC-tokens dataset. In: GrÃgoire, N., Evert, S., Krenn, B. (eds.) Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, pp. 19–22 (2008)
Cordeiro, S., Villavicencio, A., Idiart, M., Ramisch, C.: Unsupervised compositionality prediction of nominal compounds. Comput. Linguist. 45(1), 1–57 (2019)
Dankers, V., Lucas, C., Titov, I.: Can transformer be too compositional? analysing idiom processing in neural machine translation. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp. 3608–3626. ACL, May 2022
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. ACL, June 2019
Erman, B., Warren, B.: The idiom principle and the open choice principle. Text Talk 20(1), 29–62 (2000)
Ethayarajh, K.: How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 55–65. ACL, November 2019
Garcia, M.: Exploring the representation of word meanings in context: a case study on homonymy and synonymy. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3625–3640. ACL, Online, August 2021
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., Villavicencio, A.: Probing for idiomaticity in vector space models. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3551–3564. Association for Computational Linguistics, Online, April 2021. https://doi.org/10.18653/v1/2021.eacl-main.310
Haagsma, H., Bos, J., Nissim, M.: MAGPIE: a large corpus of potentially idiomatic expressions. In: Calzolari, N., Béchet, F., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, pp. 279–287. European Language Resources Association, May 2020
Jackendoff, R.: Twistin’ the night away. Language, pp. 534–559 (1997)
Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., Rychlỳ, P.: GDEX: automatically finding good dictionary examples in a corpus. In: Proceedings of the XIII EURALEX International Congress, vol. 1, pp. 425–432. Universitat Pompeu Fabra Barcelona (2008)
Misra, K.: minicons: enabling flexible behavioral and representational analyses of transformer language models. arXiv preprint arXiv:2203.13112 (2022)
Ramisch, C.: Multiword expressions in computational linguistics: down the rabbit hole and through the looking glass. Aix Marseille Università (2023)
Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Wang, H., Yarowsky, D. (eds.) Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 210–218. Asian Federation of Natural Language Processing, November 2011
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Sporleder, C., Li, L., Gorinski, P., Koch, X.: Idioms in context: the IDIX corpus. In: Calzolari, N., et al. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association (ELRA), May 2010
Straka, M., Hajič, J., Straková, J.: UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 4290–4297. European Language Resources Association (ELRA), May 2016
Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., Villavicencio, A.: SemEval-2022 task 2: multilingual idiomaticity detection and sentence embedding. In: Emerson, G., et al. (eds.) Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, United States, pp. 107–121. ACL, July 2022
Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., Villavicencio, A.: AStitchInLanguageModels: dataset and methods for the exploration of idiomaticity in pre-trained language models. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, pp. 3464–3477. ACL, November 2021
Vilares, D., Garcia, M., Gómez-Rodríguez, C.: Bertinho: Galician BERT representations. Procesamiento del Lenguaje Natural 66(0), 13–26 (2021). http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6319
Schulte im Walde, S., Hätty, A., Bott, S., Khvtisavrishvili, N.: GhoSt-NN: a representative gold standard of German noun-noun compounds. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 2285–2292. European Language Resources Association (ELRA), May 2016
Wiedemann, G., Remus, S., Chawla, A., Biemann, C.: Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, Erlangen, Germany, pp. 161–170. German Society for Computational Linguistics & Language Technology (2019)
Acknowledgement
This work was funded by MCIN/AEI/10.13039/501100011033 (grants with references PID 2021-128811OA-I00, PRE2022-102762, and TED 2021-130295B-C33, the latter also funded by “European Union Next Generation EU/PRTR”), by the Galician Government (ERDF 2014–2020: Call ED431G 2019/04, and ED431F 2021/01), and by a Ramón y Cajal grant (RYC2019-028473-I).
The authors have no competing interests to declare that are relevant to the content of this article.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Castro, L., Temerko, A., Garcia, M. (2025). Compositionality and Ambiguity in Multiword Expressions: A Dataset for the Evaluation of Language Models in Galician. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-73503-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73502-8
Online ISBN: 978-3-031-73503-5
eBook Packages: Computer ScienceComputer Science (R0)