Skip to main content

Noun Compositionality Detection Using Distributional Semantics for the Russian Language

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11832))

Abstract

In this paper, we present the first gold-standard corpus of Russian noun compounds annotated with compositionality information. We used Universal Dependency treebanks to collect noun compounds according to part of speech patterns, such as ADJ-NOUN or NOUN-NOUN and annotated them according to the following schema: a phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). Next, we conduct a series of experiments to evaluate both unsupervised and supervised methods for predicting compositionality. To expand this manually annotated dataset with more non-compositional compounds and streamline the annotation process we use active learning. We show that not only the methods, previously proposed for English, are easily adapted for Russian, but also can be exploited in active learning paradigm, that increases the efficiency of the annotation process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The dataset and the code: https://github.com/slangtech/ru-comps.

  2. 2.

    https://universaldependencies.org.

  3. 3.

    https://github.com/IINemo/active_learning_toolbox.

  4. 4.

    https://radimrehurek.com/gensim.

  5. 5.

    We used a Wikipedia dump from 02.05.2019, which consists of 1,542,621 articles.

  6. 6.

    http://github.com/nlpub/pymystem3.

References

  1. Aharodnik, K., Feldman, A., Peng, J.: Designing a Russian idiom-annotated corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  2. Anke, L.E., Schockaert, S.: Seven: augmenting word embeddings with unsupervised relation vectors. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2653–2665 (2018)

    Google Scholar 

  3. Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, vol. 18, pp. 89–96 (2003)

    Google Scholar 

  4. Baldwin, T., Villavicencio, A.: Extracting the unextractable: a case study on verb-particles. In: Proceedings of CoNLL, pp. 1–7 (2002)

    Google Scholar 

  5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016)

    Google Scholar 

  6. Breiman, L.: Classification and regression trees (2017)

    Chapter  Google Scholar 

  7. Cordeiro, S., Ramisch, C., Idiart, M., Villavicencio, A.: Predicting the compositionality of nominal compounds: giving word embeddings a hard time. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1986–1997 (2016)

    Google Scholar 

  8. Farahmand, M., Smith, A., Nivre, J.: A multiword expression data set: annotating non-compositionality and conventionalization for English noun compounds. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 29–33 (2015)

    Google Scholar 

  9. Gurrutxaga, A., Alegria, I.: Combining different features of idiomaticity for the automatic classification of noun+verb expressions in Basque. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 116–125 (2013)

    Google Scholar 

  10. Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)

    Article  Google Scholar 

  11. Jana, A., Puzyrev, D., Panchenko, A., Goyal, P., Biemann, C., Mukherjee, A.: On the compositionality prediction of noun phrases using poincaré embeddings. In: The 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019)

    Google Scholar 

  12. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR 1994, pp. 3–12 (1994)

    Chapter  Google Scholar 

  13. McCarthy, D., Keller, B., Carroll, J.: Detecting a continuum of compositionality in phrasal verbs. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, MWE 2003, vol. 18, pp. 73–80 (2003)

    Google Scholar 

  14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  15. Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL 2008: HLT, pp. 236–244 (2008)

    Google Scholar 

  16. Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, pp. 6338–6347 (2017)

    Google Scholar 

  17. Peng, J., Feldman, A.: Automatic idiom recognition with word embeddings. In: Lossio-Ventura, J.A., Alatrista-Salas, H. (eds.) SIMBig 2015-2016. CCIS, vol. 656, pp. 17–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55209-5_2

    Chapter  Google Scholar 

  18. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74 (1999)

    Google Scholar 

  19. Ramisch, C., Cordeiro, S., Zilio, L., Idiart, M., Villavicencio, A.: How naked is the naked truth? A multilingual lexicon of nominal compound compositionality. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2016)

    Google Scholar 

  20. Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 210–218 (2011)

    Google Scholar 

  21. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP frameworks, pp. 45–50 (2010)

    Google Scholar 

  22. Roller, S., Schulte im Walde, S., Scheible, S.: The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 32–41 (2013)

    Google Scholar 

  23. Savary, A., et al.: PARSEME - PARSing and Multiword Expressions within a European multilingual network. In: 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2015) (2015)

    Google Scholar 

  24. Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA (2003)

    Google Scholar 

  25. Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)

    Google Scholar 

  26. Suvorov, R., Shelmanov, A., Smirnov, I.: Active learning with adaptive density weighted sampling for information extraction from scientific papers. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 77–90. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_7

    Chapter  Google Scholar 

  27. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45–66 (2001)

    MATH  Google Scholar 

  28. Venkatapathy, S., Joshi, A.K.: Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 899–906 (2005)

    Google Scholar 

  29. Weston, J., Bordes, A., Yakhnenko, O., Usunier, N.: Connecting language and knowledge bases with embedding models for relation extraction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1366–1371 (2013)

    Google Scholar 

  30. Zhang, H.: The optimality of naive bayes. In: Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004, vol. 2 (2004)

    Google Scholar 

Download references

Acknowledgements

Dmitry Puzyrev and Ekaterina Artemova were supported by the framework of the HSE University Basic Research Program and Russian Academic Excellence Project “5–100”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ekaterina Artemova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Puzyrev, D., Shelmanov, A., Panchenko, A., Artemova, E. (2019). Noun Compositionality Detection Using Distributional Semantics for the Russian Language. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2019. Lecture Notes in Computer Science(), vol 11832. Springer, Cham. https://doi.org/10.1007/978-3-030-37334-4_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37334-4_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37333-7

  • Online ISBN: 978-3-030-37334-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics