Multilingual Embeddings for Clustering Cultural Events

Kunilovskaya, Maria; Kuzmenko, Elizaveta

doi:10.1007/978-3-031-16500-9_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13217))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

449 Accesses

Abstract

In the present paper we describe our approach to semi-automatic text annotation based on clustering. Given a large collection of announcements of cultural events from several websites, we group them based on their content and infer respective semantic categories that can be used for annotation (e.g. lecture, sports, food, music). We experiment with various models for vectorising the texts, including pretrained multilingual Sentence Transformers and multilingual ELMo models. The produced text embeddings are then clustered using K-means. We evaluate our clustering results using a stratified sample of texts with pre-existing categories (collected from websites listing the events) as well as intrinsic evaluation measures. The rationale behind this work is to produce a single categorisation covering texts from various sources and in two languages - English and Russian. The labelled collection of texts is intended for use in a Digital Humanities project aimed at describing cultural life in a selected location, for example, comparing types of events in Russian and British cities.

This work has been partly supported by the Russian Foundation for Basic Research within Project Cultural Trends in the Tyumen Region in the National and Global Contexts No. 20-411-720010 p_a_Tyumen region.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings

Toward Improved Clustering for Textual Data

Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections

Notes

1.
Based on the previous results, the texts from Behance were excluded: They did not contain the descriptions of cultural events, but were mostly captions for graphics.
2.
https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual, this model was chosen because it was fine-tuned on the semantic textual similarity task, and we want similar texts in two languages be placed in the same cluster.
3.
https://github.com/ltgoslo/simple_elmo (model #219 in the NLPL repository http://vectors.nlpl.eu/repository/).
4.
https://stats.stackexchange.com/questions/260487/adjusted-rand-index-vs-adjusted-mutual-information.
5.
Algorithm settings: stratified 5-fold cv, earlystop=True; balanced=True.

References

Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Article Google Scholar
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)
Google Scholar
Comito, C., Forestiero, A., Pizzuti, C.: Word embedding based clustering to detect topics in social media. In: 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 192–199. IEEE (2019)
Google Scholar
Färber, I., et al.: On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD, p. 1 (2010)
Google Scholar
Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199 (2019)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)
Google Scholar
Kunilovskaya, M., Plum, A.: Text preprocessing and its implications in a digital humanities project. In: Proceedings of the Student Research Workshop associated with the 13th International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 85–93 (2021)
Google Scholar
Marutho, D., Handaka, S.H., Wijaya, E., et al.: The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In: 2018 International Seminar on Application for Technology of Information and Communication, pp. 533–538. IEEE (2018)
Google Scholar
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Poomagal, S., Visalakshi, P., Hamsapriya, T.: A novel method for clustering tweets in Twitter. Int. J. Web Based Communities 11(2), 170–187 (2015)
Article Google Scholar
Rangrej, A., Kulkarni, S., Tendulkar, A.V.: Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web, pp. 111–112 (2011)
Google Scholar
Ravishankar, V., Kutuzov, A., Øvrelid, L., Velldal, E.: Multilingual ELMo and the effects of corpus sampling. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 378–384 (2021)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)
Google Scholar
Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, vol. 63 (2011)
Google Scholar
Rudrapal, D., Das, A., Bhattacharya, B.: Measuring semantic similarity for Bengali tweets using wordnet. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 537–544 (2015)
Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Google Scholar
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99 (2017)
Google Scholar
Wallerstein, I.: World-Systems Analysis: An Introduction. Duke University Press, Durham and London (2004)
Google Scholar
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Tyumen, Tyumen, Russia
Maria Kunilovskaya
University of Wolverhampton, Wolverhampton, UK
Maria Kunilovskaya
University of Trento, Trento, Italy
Elizaveta Kuzmenko

Authors

Maria Kunilovskaya
View author publications
You can also search for this author in PubMed Google Scholar
Elizaveta Kuzmenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria Kunilovskaya .

Editor information

Editors and Affiliations

Skolkovo Institute of Science and Technology, Moscow, Russia
Evgeny Burnaev
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Skolkovo Institute of Science and Technology, Moscow, Russia
Sergei Ivanov
Krasovskii Institute of Mathematics and Mechanics of Russian Academy of Sciences, Yekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, St. Petersburg, Russia
Olessia Koltsova
University of Oslo, Oslo, Norway
Andrei Kutuzov
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
Lomonosov Moscow State University, Moscow, Russia
Natalia Loukachevitch
LORIA, Campus Scientifique, Vandœuvre lès Nancy, France
Amedeo Napoli
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
University of Florida, Gainesville, USA
Panos M. Pardalos
Aalto University, Espoo, Finland
Jari Saramäki
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Yandex LLC, Moscow, Russia
Evgenii Tsymbalov
Kazan Federal University, Kazan, Russia
Elena Tutubalina

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kunilovskaya, M., Kuzmenko, E. (2022). Multilingual Embeddings for Clustering Cultural Events. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-16500-9_8
Published: 02 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16499-6
Online ISBN: 978-3-031-16500-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multilingual Embeddings for Clustering Cultural Events