Compositionality and Ambiguity in Multiword Expressions: A Dataset for the Evaluation of Language Models in Galician

Castro, Laura; Temerko, Anna; Garcia, Marcos

doi:10.1007/978-3-031-73503-5_19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14969))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

190 Accesses

Abstract

This paper presents a new dataset of noun-adjective multiword expressions with different degrees of compositionality and semantic ambiguity in Galician. It is composed of 240 MWEs, which can convey one or different senses depending on the context. For each sense, a language expert manually created two sentences and selected from corpora four additional examples that included the target MWEs, thus resulting in a useful resource for exploring potential data contamination when evaluating language models. Each MWE in context was then classified as idiomatic, partially idiomatic, or compositional. Therefore, the dataset comprises MWEs with stable meanings, and two types of ambiguous expressions: 1) potential idiomatic expressions (e.g., red flag), and 2) polysemy-based ambiguous MWEs, whose various senses are due to the ambiguity of one of the constituent words (e.g., common noun as a type of noun or a noun that is common). To illustrate the potential of this resource, a comparison of three BERT models for Galician was performed, shedding light on the representation of ambiguous MWEs in Transformers. This is a valuable resource for evaluating the semantic capabilities of current language models in a low-resource variety, bearing in mind that idiomaticity is one of the linguistic phenomena whose modeling poses the greatest challenges for computational approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment

Article 11 January 2016

SEJF - A Grammatical Lexicon of Polish Multiword Expressions

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven Approach

Notes

1.
The dataset can be found at https://github.com/Castro-L/MWE_dataset_gl.

References

Balloccu, S., Schmidtová, P., Lango, M., Dusek, O.: Leak, cheat, repeat: data contamination and evaluation malpractices in closed-source LLMs. In: Graham, Y., Purver, M. (eds.) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, pp. 67–93. ACL, March 2024
Google Scholar
Bannard, C., Baldwin, T., Lascarides, A.: A statistical approach to the semantics of verb-particles. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 65–72. ACL, July 2003
Google Scholar
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E.: Longman Grammar of Spoken and Written English, 1st edn. Pearson Education Ltd, Harlow (1999)
Google Scholar
Constant, M., et al.: Survey: multiword expression processing: a survey. Comput. Linguist. 43(4), 837–892 (2017)
Article MathSciNet Google Scholar
Cook, P., Fazly, A., Stevenson, S.: The VNC-tokens dataset. In: GrÃgoire, N., Evert, S., Krenn, B. (eds.) Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, pp. 19–22 (2008)
Google Scholar
Cordeiro, S., Villavicencio, A., Idiart, M., Ramisch, C.: Unsupervised compositionality prediction of nominal compounds. Comput. Linguist. 45(1), 1–57 (2019)
Article MathSciNet Google Scholar
Dankers, V., Lucas, C., Titov, I.: Can transformer be too compositional? analysing idiom processing in neural machine translation. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp. 3608–3626. ACL, May 2022
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. ACL, June 2019
Google Scholar
Erman, B., Warren, B.: The idiom principle and the open choice principle. Text Talk 20(1), 29–62 (2000)
Google Scholar
Ethayarajh, K.: How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 55–65. ACL, November 2019
Google Scholar
Garcia, M.: Exploring the representation of word meanings in context: a case study on homonymy and synonymy. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3625–3640. ACL, Online, August 2021
Google Scholar
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., Villavicencio, A.: Probing for idiomaticity in vector space models. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3551–3564. Association for Computational Linguistics, Online, April 2021. https://doi.org/10.18653/v1/2021.eacl-main.310
Haagsma, H., Bos, J., Nissim, M.: MAGPIE: a large corpus of potentially idiomatic expressions. In: Calzolari, N., Béchet, F., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, pp. 279–287. European Language Resources Association, May 2020
Google Scholar
Jackendoff, R.: Twistin’ the night away. Language, pp. 534–559 (1997)
Google Scholar
Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., Rychlỳ, P.: GDEX: automatically finding good dictionary examples in a corpus. In: Proceedings of the XIII EURALEX International Congress, vol. 1, pp. 425–432. Universitat Pompeu Fabra Barcelona (2008)
Google Scholar
Misra, K.: minicons: enabling flexible behavioral and representational analyses of transformer language models. arXiv preprint arXiv:2203.13112 (2022)
Ramisch, C.: Multiword expressions in computational linguistics: down the rabbit hole and through the looking glass. Aix Marseille UniversitÃ (2023)
Google Scholar
Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Wang, H., Yarowsky, D. (eds.) Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 210–218. Asian Federation of Natural Language Processing, November 2011
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Chapter Google Scholar
Sporleder, C., Li, L., Gorinski, P., Koch, X.: Idioms in context: the IDIX corpus. In: Calzolari, N., et al. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association (ELRA), May 2010
Google Scholar
Straka, M., Hajič, J., Straková, J.: UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 4290–4297. European Language Resources Association (ELRA), May 2016
Google Scholar
Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., Villavicencio, A.: SemEval-2022 task 2: multilingual idiomaticity detection and sentence embedding. In: Emerson, G., et al. (eds.) Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, United States, pp. 107–121. ACL, July 2022
Google Scholar
Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., Villavicencio, A.: AStitchInLanguageModels: dataset and methods for the exploration of idiomaticity in pre-trained language models. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, pp. 3464–3477. ACL, November 2021
Google Scholar
Vilares, D., Garcia, M., Gómez-Rodríguez, C.: Bertinho: Galician BERT representations. Procesamiento del Lenguaje Natural 66(0), 13–26 (2021). http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6319
Schulte im Walde, S., Hätty, A., Bott, S., Khvtisavrishvili, N.: GhoSt-NN: a representative gold standard of German noun-noun compounds. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 2285–2292. European Language Resources Association (ELRA), May 2016
Google Scholar
Wiedemann, G., Remus, S., Chawla, A., Biemann, C.: Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, Erlangen, Germany, pp. 161–170. German Society for Computational Linguistics & Language Technology (2019)
Google Scholar

Download references

Acknowledgement

This work was funded by MCIN/AEI/10.13039/501100011033 (grants with references PID 2021-128811OA-I00, PRE2022-102762, and TED 2021-130295B-C33, the latter also funded by “European Union Next Generation EU/PRTR”), by the Galician Government (ERDF 2014–2020: Call ED431G 2019/04, and ED431F 2021/01), and by a Ramón y Cajal grant (RYC2019-028473-I).

The authors have no competing interests to declare that are relevant to the content of this article.

Author information

Authors and Affiliations

Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain
Laura Castro, Anna Temerko & Marcos Garcia

Authors

Laura Castro
View author publications
You can also search for this author in PubMed Google Scholar
Anna Temerko
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Garcia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Castro .

Editor information

Editors and Affiliations

University of Minho, Braga, Portugal
Manuel Filipe Santos
University of Minho, Braga, Portugal
José Machado
University of Minho, Braga, Portugal
Paulo Novais
University of Minho, Braga, Portugal
Paulo Cortez
Polytechnic Institute of Viana do Castelo, Viana do Castelo, Portugal
Pedro Miguel Moreira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Castro, L., Temerko, A., Garcia, M. (2025). Compositionality and Ambiguity in Multiword Expressions: A Dataset for the Evaluation of Language Models in Galician. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-73503-5_19
Published: 16 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73502-8
Online ISBN: 978-3-031-73503-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Compositionality and Ambiguity in Multiword Expressions: A Dataset for the Evaluation of Language Models in Galician