Skip to main content

Multilingual Embeddings for Clustering Cultural Events

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2021)

Abstract

In the present paper we describe our approach to semi-automatic text annotation based on clustering. Given a large collection of announcements of cultural events from several websites, we group them based on their content and infer respective semantic categories that can be used for annotation (e.g. lecture, sports, food, music). We experiment with various models for vectorising the texts, including pretrained multilingual Sentence Transformers and multilingual ELMo models. The produced text embeddings are then clustered using K-means. We evaluate our clustering results using a stratified sample of texts with pre-existing categories (collected from websites listing the events) as well as intrinsic evaluation measures. The rationale behind this work is to produce a single categorisation covering texts from various sources and in two languages - English and Russian. The labelled collection of texts is intended for use in a Digital Humanities project aimed at describing cultural life in a selected location, for example, comparing types of events in Russian and British cities.

This work has been partly supported by the Russian Foundation for Basic Research within Project Cultural Trends in the Tyumen Region in the National and Global Contexts No. 20-411-720010 p_a_Tyumen region.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Based on the previous results, the texts from Behance were excluded: They did not contain the descriptions of cultural events, but were mostly captions for graphics.

  2. 2.

    https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual, this model was chosen because it was fine-tuned on the semantic textual similarity task, and we want similar texts in two languages be placed in the same cluster.

  3. 3.

    https://github.com/ltgoslo/simple_elmo (model #219 in the NLPL repository http://vectors.nlpl.eu/repository/).

  4. 4.

    https://stats.stackexchange.com/questions/260487/adjusted-rand-index-vs-adjusted-mutual-information.

  5. 5.

    Algorithm settings: stratified 5-fold cv, earlystop=True; balanced=True.

References

  1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)

    Article  Google Scholar 

  2. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)

    Google Scholar 

  3. Comito, C., Forestiero, A., Pizzuti, C.: Word embedding based clustering to detect topics in social media. In: 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 192–199. IEEE (2019)

    Google Scholar 

  4. Färber, I., et al.: On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD, p. 1 (2010)

    Google Scholar 

  5. Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199 (2019)

    Google Scholar 

  6. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)

  7. Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)

    Google Scholar 

  8. Kunilovskaya, M., Plum, A.: Text preprocessing and its implications in a digital humanities project. In: Proceedings of the Student Research Workshop associated with the 13th International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 85–93 (2021)

    Google Scholar 

  9. Marutho, D., Handaka, S.H., Wijaya, E., et al.: The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In: 2018 International Seminar on Application for Technology of Information and Communication, pp. 533–538. IEEE (2018)

    Google Scholar 

  10. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)

    Google Scholar 

  11. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

  12. Poomagal, S., Visalakshi, P., Hamsapriya, T.: A novel method for clustering tweets in Twitter. Int. J. Web Based Communities 11(2), 170–187 (2015)

    Article  Google Scholar 

  13. Rangrej, A., Kulkarni, S., Tendulkar, A.V.: Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web, pp. 111–112 (2011)

    Google Scholar 

  14. Ravishankar, V., Kutuzov, A., Øvrelid, L., Velldal, E.: Multilingual ELMo and the effects of corpus sampling. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 378–384 (2021)

    Google Scholar 

  15. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  16. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)

    Google Scholar 

  17. Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, vol. 63 (2011)

    Google Scholar 

  18. Rudrapal, D., Das, A., Bhattacharya, B.: Measuring semantic similarity for Bengali tweets using wordnet. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 537–544 (2015)

    Google Scholar 

  19. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)

    Google Scholar 

  20. Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99 (2017)

    Google Scholar 

  21. Wallerstein, I.: World-Systems Analysis: An Introduction. Duke University Press, Durham and London (2004)

    Google Scholar 

  22. Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Kunilovskaya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kunilovskaya, M., Kuzmenko, E. (2022). Multilingual Embeddings for Clustering Cultural Events. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16500-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16499-6

  • Online ISBN: 978-3-031-16500-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics