Noun Compositionality Detection Using Distributional Semantics for the Russian Language

Puzyrev, Dmitry; Shelmanov, Artem; Panchenko, Alexander; Artemova, Ekaterina

doi:10.1007/978-3-030-37334-4_20

Dmitry Puzyrev²²,
Artem Shelmanov²³,
Alexander Panchenko²³ &
…
Ekaterina Artemova²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11832))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

990 Accesses
1 Citations

Abstract

In this paper, we present the first gold-standard corpus of Russian noun compounds annotated with compositionality information. We used Universal Dependency treebanks to collect noun compounds according to part of speech patterns, such as ADJ-NOUN or NOUN-NOUN and annotated them according to the following schema: a phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). Next, we conduct a series of experiments to evaluate both unsupervised and supervised methods for predicting compositionality. To expand this manually annotated dataset with more non-compositional compounds and streamline the annotation process we use active learning. We show that not only the methods, previously proposed for English, are easily adapted for Russian, but also can be exploited in active learning paradigm, that increases the efficiency of the annotation process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The dataset and the code: https://github.com/slangtech/ru-comps.
2.
https://universaldependencies.org.
3.
https://github.com/IINemo/active_learning_toolbox.
4.
https://radimrehurek.com/gensim.
5.
We used a Wikipedia dump from 02.05.2019, which consists of 1,542,621 articles.
6.
http://github.com/nlpub/pymystem3.

References

Aharodnik, K., Feldman, A., Peng, J.: Designing a Russian idiom-annotated corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Anke, L.E., Schockaert, S.: Seven: augmenting word embeddings with unsupervised relation vectors. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2653–2665 (2018)
Google Scholar
Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, vol. 18, pp. 89–96 (2003)
Google Scholar
Baldwin, T., Villavicencio, A.: Extracting the unextractable: a case study on verb-particles. In: Proceedings of CoNLL, pp. 1–7 (2002)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016)
Google Scholar
Breiman, L.: Classification and regression trees (2017)
Chapter Google Scholar
Cordeiro, S., Ramisch, C., Idiart, M., Villavicencio, A.: Predicting the compositionality of nominal compounds: giving word embeddings a hard time. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1986–1997 (2016)
Google Scholar
Farahmand, M., Smith, A., Nivre, J.: A multiword expression data set: annotating non-compositionality and conventionalization for English noun compounds. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 29–33 (2015)
Google Scholar
Gurrutxaga, A., Alegria, I.: Combining different features of idiomaticity for the automatic classification of noun+verb expressions in Basque. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 116–125 (2013)
Google Scholar
Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)
Article Google Scholar
Jana, A., Puzyrev, D., Panchenko, A., Goyal, P., Biemann, C., Mukherjee, A.: On the compositionality prediction of noun phrases using poincaré embeddings. In: The 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019)
Google Scholar
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR 1994, pp. 3–12 (1994)
Chapter Google Scholar
McCarthy, D., Keller, B., Carroll, J.: Detecting a continuum of compositionality in phrasal verbs. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, MWE 2003, vol. 18, pp. 73–80 (2003)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL 2008: HLT, pp. 236–244 (2008)
Google Scholar
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, pp. 6338–6347 (2017)
Google Scholar
Peng, J., Feldman, A.: Automatic idiom recognition with word embeddings. In: Lossio-Ventura, J.A., Alatrista-Salas, H. (eds.) SIMBig 2015-2016. CCIS, vol. 656, pp. 17–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55209-5_2
Chapter Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74 (1999)
Google Scholar
Ramisch, C., Cordeiro, S., Zilio, L., Idiart, M., Villavicencio, A.: How naked is the naked truth? A multilingual lexicon of nominal compound compositionality. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2016)
Google Scholar
Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 210–218 (2011)
Google Scholar
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP frameworks, pp. 45–50 (2010)
Google Scholar
Roller, S., Schulte im Walde, S., Scheible, S.: The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 32–41 (2013)
Google Scholar
Savary, A., et al.: PARSEME - PARSing and Multiword Expressions within a European multilingual network. In: 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2015) (2015)
Google Scholar
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA (2003)
Google Scholar
Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)
Google Scholar
Suvorov, R., Shelmanov, A., Smirnov, I.: Active learning with adaptive density weighted sampling for information extraction from scientific papers. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 77–90. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_7
Chapter Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45–66 (2001)
MATH Google Scholar
Venkatapathy, S., Joshi, A.K.: Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 899–906 (2005)
Google Scholar
Weston, J., Bordes, A., Yakhnenko, O., Usunier, N.: Connecting language and knowledge bases with embedding models for relation extraction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1366–1371 (2013)
Google Scholar
Zhang, H.: The optimality of naive bayes. In: Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004, vol. 2 (2004)
Google Scholar

Download references

Acknowledgements

Dmitry Puzyrev and Ekaterina Artemova were supported by the framework of the HSE University Basic Research Program and Russian Academic Excellence Project “5–100”.

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Dmitry Puzyrev & Ekaterina Artemova
Skolkovo Institute of Science and Technology, Moscow, Russia
Artem Shelmanov & Alexander Panchenko

Authors

Dmitry Puzyrev
View author publications
You can also search for this author in PubMed Google Scholar
Artem Shelmanov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Panchenko
View author publications
You can also search for this author in PubMed Google Scholar
Ekaterina Artemova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ekaterina Artemova .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Wil M. P. van der Aalst
University of Ljubljana, Ljubljana, Slovenia
Vladimir Batagelj
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, Moscow, Russia
Valentina Kuskova
University of Oslo, Oslo, Norway
Andrey Kutuzov
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
National Research University Higher School of Economics, Moscow, Russia
Irina A. Lomazova
Lomonosov Moscow State University, Moscow, Russia
Natalia Loukachevitch
LORIA, Vandœuvre-lès-Nancy, France
Amedeo Napoli
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Ca Foscari University of Venice, Venice, Italy
Marcello Pelillo
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Kazan Federal University, Kazan, Russia
Elena Tutubalina

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Puzyrev, D., Shelmanov, A., Panchenko, A., Artemova, E. (2019). Noun Compositionality Detection Using Distributional Semantics for the Russian Language. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2019. Lecture Notes in Computer Science(), vol 11832. Springer, Cham. https://doi.org/10.1007/978-3-030-37334-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-37334-4_20
Published: 15 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37333-7
Online ISBN: 978-3-030-37334-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics