Slovene Multi-word Units: Identification, Categorization, and Representation

Gantar, Polona; Čibej, Jaka; Bon, Mija

doi:10.1007/978-3-030-30135-4_8

Polona Gantar¹⁰,
Jaka Čibej¹¹ &
Mija Bon¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11755))

Included in the following conference series:

International Conference on Computational and Corpus-Based Phraseology

751 Accesses

Abstract

In this paper, we present the results of a manual annotation of a Slovene training corpus with multi-word units (MWUs) relevant for inclusion in a lexicon of Slovene MWUs. We analyze the annotations in terms of (a) the frequency with which a string has been identified as a MWU, (b) the degree to which the annotators agree on the category of the identified MWU, and (c) the degree to which the annotators agree on the range of the MWU in terms of its lexicalized elements. The results of the analysis will be useful in different stages of the compilation of a Slovene MWU lexicon. The list of dictionary-relevant MWUs obtained in the annotation task will be used to enrich the lexicon and to train models for the automatic identification of MWUs in running text. The findings will also help revise the criteria for the identification and categorization of dictionary-relevant MWUs in relation to free phrases, as well as more clearly define the distinction between the lexicalized elements of MWUs and the more or less stable elements of their textual environment, which will be useful when determining the canonical forms of MWUs in the lexicon on one hand and their relation to their variable elements and syntactic conversions on the other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this case, lexicalized elements refer to the elements that must be present in each occurrence of the MWU and must always be realized by the same lexeme.
2.
For a detailed description, see Gantar et al. (2017, 2019).
3.
Collocations were also excluded from the PARSEME Shared Task annotation campaign.
4.
https://viri.cjvt.si/kolokacije/eng/.
5.
In related work on English MWUs, these expressions are usually called compounds. See Atkins and Rundell (2008: 171) for detailed classification.
6.
This category of MWUs has also been called compound prepositions (in spite of), MWUs with syntactic function (with regard to), prepositional phrases (in bed, in jail), complex prepositions (on top of), etc. For a more detailed overview, see Gantar et al. (2019).
7.
http://gigafida.net.
8.
https://viri.cjvt.si/kolokacije/eng.
9.
Each token in the ssj500k v2.1 corpus has a unique ID. We used IDs instead of word forms or word lemmas to join batches to avoid introducing noise in case the same form/lemma occurred multiple times in the sentence.
10.
The lemmatized form sorted in alphabetical order was used in order to aggregate strings that were essentially the same, but differed inflectionally, e.g. ustavno sodišče (‘constitutional court’ - nominative), ustavnega sodišča (‘constitutional court’ - genitive).
11.
For the sake of conciseness, each different form in the cluster is only shown once although it may actually appear multiple times.
12.
24 clusters were excluded from the analysis either because of clustering errors (see Sect. 2.2) or because the annotator incorrectly included two MWUs in a single annotation or annotated only a single element of an otherwise correctly identified MWU.
.
13.
In some cases, the possessive pronoun can also be lexicalized, e.g. proti svoji volji ‘against his/her/their own will’.

References

Arhar Holdt, Š., Gorjanc, V.: Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in slovstvo 52(2), 95–110 (2007)
Google Scholar
Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, New York (2008)
Google Scholar
Gantar, P.: Stalne besedne zveze v slovenščini. Založba ZRC, ZRC SAZU, Ljubljana (2007)
Google Scholar
Gantar, P., Krek, S.: Slovene lexical database. In: Majchráková, D., Garabík, R. (eds.) Proceedings of the Natural Language Processing, Multilinguality: Sixth International Conference, Modra, Slovakia, 20–21 October 2011, pp. 72–80. Tribun EU, Brno (2011)
Google Scholar
Gantar, P.: Leksikografski opis slovenščine v digitalnem okolju. Znanstvena založba Filozofske fakultete UL, Ljubljana (2015)
Google Scholar
Gantar, P., Krek, S., Kuzman, T.: Verbal multiword expressions in Slovene. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 247–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_18
Chapter Google Scholar
Gantar, P., Colman, L., Parra Escartín, C., Martínez Alonso, H.: Multiword expressions: between lexicography and NLP. Int. J. Lexicogr. 32(2), 138–162 (2019). https://doi.org/10.1093/ijl/ecy012
Article Google Scholar
Hanks, P., El Marouf, I., Oakes, M.: Flexibility of multiword expressions and corpus pattern analysis. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-lingual Perspective, pp. 93–119. Language Science Press, Berlin (2018)
Google Scholar
Hunston, S., Francis, G.: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. John Benjamins, Amsterdam (2000)
Book Google Scholar
Kosem, I., et al.: Collocations Dictionary of Modern Slovene (2018). https://viri.cjvt.si/kolokacije/eng/
Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., Laskowski, C.: Kolokacijski slovar sodobne slovenščine. In: Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, pp. 133–139 (2018)
Google Scholar
Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. Inf. Technol. 105, 116–127 (2004)
Google Scholar
Krek, S., Gantar, P., Kosem, I., Gorjanc, V., Laskowski, C.: Baza kolokacijskega slovarja slovenskega jezika. In: Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, pp. 101–105 (2016)
Google Scholar
Krek, S., et al.: Training corpus ssj500k 2.1, Slovenian language resource repository CLARIN.SI (2018). http://hdl.handle.net/11356/1181
Moon, R.: Fixed Expressions and Idioms in English. A Corpus-Based Approach. Clarendon Press, Oxford (1998)
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2002), pp. 1–15 (2002)
Google Scholar
Sinclair, J. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary. Collins, London and Glasgow (1987)
Google Scholar
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Google Scholar
Sinclair, J.: The lexical item. In: Weigand, E. (ed.) Contrastive Lexical Semantics, pp. 1–24. John Benjamins Publishing Company, Amsterdam/Philadelphia (1998)
Google Scholar

Download references

Acknowledgments

The study presented in this paper was conducted within the New Grammar of Modern Standard Slovene: Resource and Methods project (J6-8256), which was financially supported by the Slovenian Research Agency between 2017 and 2020. The authors also acknowledge the financial support from the Slovenian Research Agency (research core funding No. P6-0411 - Language Resources and Technologies for Slovene and No. P6-0215 - Slovene Language – Basic, Contrastive, and Applied Studies). The authors would also like to thank the annotators: Anna Maria Grego, Tjaša Jelovšek, Tajda Liplin Šerbetar, Pia Rednak, Jana Vaupotič, Zala Vidic, Karolina Zgaga, and Kaja Žvanut.

Author information

Authors and Affiliations

Faculty of Arts, University of Ljubljana, Slovenia, Ljubljana, Slovenia
Polona Gantar & Mija Bon
Jožef Stefan Institute, Ljubljana, Slovenia
Jaka Čibej

Authors

Polona Gantar
View author publications
You can also search for this author in PubMed Google Scholar
Jaka Čibej
View author publications
You can also search for this author in PubMed Google Scholar
Mija Bon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Polona Gantar , Jaka Čibej or Mija Bon .

Editor information

Editors and Affiliations

University of Malaga, Malaga, Spain
Gloria Corpas Pastor
University of Wolverhampton, Wolverhampton, UK
Ruslan Mitkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gantar, P., Čibej, J., Bon, M. (2019). Slovene Multi-word Units: Identification, Categorization, and Representation. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham. https://doi.org/10.1007/978-3-030-30135-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-30135-4_8
Published: 18 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30134-7
Online ISBN: 978-3-030-30135-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics