Skip to main content

Compositionality and Ambiguity in Multiword Expressions: A Dataset for the Evaluation of Language Models in Galician

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2024)

Abstract

This paper presents a new dataset of noun-adjective multiword expressions with different degrees of compositionality and semantic ambiguity in Galician. It is composed of 240 MWEs, which can convey one or different senses depending on the context. For each sense, a language expert manually created two sentences and selected from corpora four additional examples that included the target MWEs, thus resulting in a useful resource for exploring potential data contamination when evaluating language models. Each MWE in context was then classified as idiomatic, partially idiomatic, or compositional. Therefore, the dataset comprises MWEs with stable meanings, and two types of ambiguous expressions: 1) potential idiomatic expressions (e.g., red flag), and 2) polysemy-based ambiguous MWEs, whose various senses are due to the ambiguity of one of the constituent words (e.g., common noun as a type of noun or a noun that is common). To illustrate the potential of this resource, a comparison of three BERT models for Galician was performed, shedding light on the representation of ambiguous MWEs in Transformers. This is a valuable resource for evaluating the semantic capabilities of current language models in a low-resource variety, bearing in mind that idiomaticity is one of the linguistic phenomena whose modeling poses the greatest challenges for computational approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The dataset can be found at https://github.com/Castro-L/MWE_dataset_gl.

References

  1. Balloccu, S., Schmidtová, P., Lango, M., Dusek, O.: Leak, cheat, repeat: data contamination and evaluation malpractices in closed-source LLMs. In: Graham, Y., Purver, M. (eds.) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, pp. 67–93. ACL, March 2024

    Google Scholar 

  2. Bannard, C., Baldwin, T., Lascarides, A.: A statistical approach to the semantics of verb-particles. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 65–72. ACL, July 2003

    Google Scholar 

  3. Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E.: Longman Grammar of Spoken and Written English, 1st edn. Pearson Education Ltd, Harlow (1999)

    Google Scholar 

  4. Constant, M., et al.: Survey: multiword expression processing: a survey. Comput. Linguist. 43(4), 837–892 (2017)

    Article  MathSciNet  Google Scholar 

  5. Cook, P., Fazly, A., Stevenson, S.: The VNC-tokens dataset. In: GrÃgoire, N., Evert, S., Krenn, B. (eds.) Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, pp. 19–22 (2008)

    Google Scholar 

  6. Cordeiro, S., Villavicencio, A., Idiart, M., Ramisch, C.: Unsupervised compositionality prediction of nominal compounds. Comput. Linguist. 45(1), 1–57 (2019)

    Article  MathSciNet  Google Scholar 

  7. Dankers, V., Lucas, C., Titov, I.: Can transformer be too compositional? analysing idiom processing in neural machine translation. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp. 3608–3626. ACL, May 2022

    Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. ACL, June 2019

    Google Scholar 

  9. Erman, B., Warren, B.: The idiom principle and the open choice principle. Text Talk 20(1), 29–62 (2000)

    Google Scholar 

  10. Ethayarajh, K.: How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 55–65. ACL, November 2019

    Google Scholar 

  11. Garcia, M.: Exploring the representation of word meanings in context: a case study on homonymy and synonymy. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3625–3640. ACL, Online, August 2021

    Google Scholar 

  12. Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., Villavicencio, A.: Probing for idiomaticity in vector space models. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3551–3564. Association for Computational Linguistics, Online, April 2021. https://doi.org/10.18653/v1/2021.eacl-main.310

  13. Haagsma, H., Bos, J., Nissim, M.: MAGPIE: a large corpus of potentially idiomatic expressions. In: Calzolari, N., Béchet, F., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, pp. 279–287. European Language Resources Association, May 2020

    Google Scholar 

  14. Jackendoff, R.: Twistin’ the night away. Language, pp. 534–559 (1997)

    Google Scholar 

  15. Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., Rychlỳ, P.: GDEX: automatically finding good dictionary examples in a corpus. In: Proceedings of the XIII EURALEX International Congress, vol. 1, pp. 425–432. Universitat Pompeu Fabra Barcelona (2008)

    Google Scholar 

  16. Misra, K.: minicons: enabling flexible behavioral and representational analyses of transformer language models. arXiv preprint arXiv:2203.13112 (2022)

  17. Ramisch, C.: Multiword expressions in computational linguistics: down the rabbit hole and through the looking glass. Aix Marseille Università (2023)

    Google Scholar 

  18. Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Wang, H., Yarowsky, D. (eds.) Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 210–218. Asian Federation of Natural Language Processing, November 2011

    Google Scholar 

  19. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1

    Chapter  Google Scholar 

  20. Sporleder, C., Li, L., Gorinski, P., Koch, X.: Idioms in context: the IDIX corpus. In: Calzolari, N., et al. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association (ELRA), May 2010

    Google Scholar 

  21. Straka, M., Hajič, J., Straková, J.: UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 4290–4297. European Language Resources Association (ELRA), May 2016

    Google Scholar 

  22. Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., Villavicencio, A.: SemEval-2022 task 2: multilingual idiomaticity detection and sentence embedding. In: Emerson, G., et al. (eds.) Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, United States, pp. 107–121. ACL, July 2022

    Google Scholar 

  23. Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., Villavicencio, A.: AStitchInLanguageModels: dataset and methods for the exploration of idiomaticity in pre-trained language models. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, pp. 3464–3477. ACL, November 2021

    Google Scholar 

  24. Vilares, D., Garcia, M., Gómez-Rodrí­guez, C.: Bertinho: Galician BERT representations. Procesamiento del Lenguaje Natural 66(0), 13–26 (2021). http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6319

  25. Schulte im Walde, S., Hätty, A., Bott, S., Khvtisavrishvili, N.: GhoSt-NN: a representative gold standard of German noun-noun compounds. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 2285–2292. European Language Resources Association (ELRA), May 2016

    Google Scholar 

  26. Wiedemann, G., Remus, S., Chawla, A., Biemann, C.: Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, Erlangen, Germany, pp. 161–170. German Society for Computational Linguistics & Language Technology (2019)

    Google Scholar 

Download references

Acknowledgement

This work was funded by MCIN/AEI/10.13039/501100011033 (grants with references PID 2021-128811OA-I00, PRE2022-102762, and TED 2021-130295B-C33, the latter also funded by “European Union Next Generation EU/PRTR”), by the Galician Government (ERDF 2014–2020: Call ED431G 2019/04, and ED431F 2021/01), and by a Ramón y Cajal grant (RYC2019-028473-I).

The authors have no competing interests to declare that are relevant to the content of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Castro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Castro, L., Temerko, A., Garcia, M. (2025). Compositionality and Ambiguity in Multiword Expressions: A Dataset for the Evaluation of Language Models in Galician. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73503-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73502-8

  • Online ISBN: 978-3-031-73503-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics