Abstract
In the present paper we describe our approach to semi-automatic text annotation based on clustering. Given a large collection of announcements of cultural events from several websites, we group them based on their content and infer respective semantic categories that can be used for annotation (e.g. lecture, sports, food, music). We experiment with various models for vectorising the texts, including pretrained multilingual Sentence Transformers and multilingual ELMo models. The produced text embeddings are then clustered using K-means. We evaluate our clustering results using a stratified sample of texts with pre-existing categories (collected from websites listing the events) as well as intrinsic evaluation measures. The rationale behind this work is to produce a single categorisation covering texts from various sources and in two languages - English and Russian. The labelled collection of texts is intended for use in a Digital Humanities project aimed at describing cultural life in a selected location, for example, comparing types of events in Russian and British cities.
This work has been partly supported by the Russian Foundation for Basic Research within Project Cultural Trends in the Tyumen Region in the National and Global Contexts No. 20-411-720010 p_a_Tyumen region.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Based on the previous results, the texts from Behance were excluded: They did not contain the descriptions of cultural events, but were mostly captions for graphics.
- 2.
https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual, this model was chosen because it was fine-tuned on the semantic textual similarity task, and we want similar texts in two languages be placed in the same cluster.
- 3.
https://github.com/ltgoslo/simple_elmo (model #219 in the NLPL repository http://vectors.nlpl.eu/repository/).
- 4.
- 5.
Algorithm settings: stratified 5-fold cv, earlystop=True; balanced=True.
References
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)
Comito, C., Forestiero, A., Pizzuti, C.: Word embedding based clustering to detect topics in social media. In: 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 192–199. IEEE (2019)
Färber, I., et al.: On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD, p. 1 (2010)
Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199 (2019)
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)
Kunilovskaya, M., Plum, A.: Text preprocessing and its implications in a digital humanities project. In: Proceedings of the Student Research Workshop associated with the 13th International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 85–93 (2021)
Marutho, D., Handaka, S.H., Wijaya, E., et al.: The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In: 2018 International Seminar on Application for Technology of Information and Communication, pp. 533–538. IEEE (2018)
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Poomagal, S., Visalakshi, P., Hamsapriya, T.: A novel method for clustering tweets in Twitter. Int. J. Web Based Communities 11(2), 170–187 (2015)
Rangrej, A., Kulkarni, S., Tendulkar, A.V.: Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web, pp. 111–112 (2011)
Ravishankar, V., Kutuzov, A., Øvrelid, L., Velldal, E.: Multilingual ELMo and the effects of corpus sampling. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 378–384 (2021)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)
Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, vol. 63 (2011)
Rudrapal, D., Das, A., Bhattacharya, B.: Measuring semantic similarity for Bengali tweets using wordnet. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 537–544 (2015)
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99 (2017)
Wallerstein, I.: World-Systems Analysis: An Introduction. Duke University Press, Durham and London (2004)
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kunilovskaya, M., Kuzmenko, E. (2022). Multilingual Embeddings for Clustering Cultural Events. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-16500-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16499-6
Online ISBN: 978-3-031-16500-9
eBook Packages: Computer ScienceComputer Science (R0)