Keywords

1 Introduction

Nowadays, social networks have created a mean of communication, that was unprecedented years ago. Informal communication, blogging, and online discussions have transformed the Web into a huge archive of remarks on numerous themes, producing a potential wellspring of data for various areas. The accessibility of large–scale electronic social information from the Web and other electronic means is as of now changing how people nowadays communicate [14]. The social networks are also being used for other objectives, for example, consequently separating client opinions about products or brands [15], nowcasting earthquakes [12] and detecting suicidality [13].

Twitter, one of the most used Social Networks, can be depicted as a informal community website that captures messages of 280 characters. This micro-blogging service, provides users with a framework for writing brief, often-noisy postings about different subjects. These posts are called “Tweets”. It is for blogging on the grounds that the focal action is posting short announcement messages (tweets) by means of the Web or handheld device. Twitter is additionally an interpersonal organization site since individuals have a profile page and those individuals can be associated with different individuals by “following” them. A common element of Twitter is retweeting: sending a tweet by posting it once more. The reposting of the equivalent (or comparative) data works since individuals will in general follow various arrangements of individuals, in spite of the fact that retweeting likewise fills different needs. For example, helping supporters to discover more established posts. Another element of Twitter (and other social networks) is the hashtag: a metatag starting with # that is intended to help other people discover a post, regularly by denoting the Tweet theme or its target group. This component appears to have been created by Twitter clients, in mid 2008 [8]. The utilization of hashtags stresses the significance of generally conveying data in Twitter. Conversely, the character is utilized to deliver a post to another enrolled Twitter client, permitting Twitter to be used successfully for discussions and coordinated effort.

In order to analyze and extract semantic information about this huge amount of data generated from this microblogging platform, automatic methods are necessary. In this sense, Topic Models are a very useful tool for this purpose. Topic models are statistically inspired and unravel the hidden structure in large collections of texts.

In this paper we analyze the social impact of the performance “A rapist in your path” (Un violador en tu camino) proposed by the feminist collective Las Tesis. Although the performance started in several cities in Chile, this performance has been also replicated in different cities around the world. Some of this cities were Paris, London, Barcelona, New York, Mexico City, Istanbul, Madrid, Berlin and Bogotá. This street art intervention greatly exceeded national borders and has brought together hundreds of women around the world, who have organized to replicate the choreography and song created by four women from Valparaíso, Daffne Valdés Vargas, Sibila Sotomayor Van Rysseghem, Paula Cometa Stange and Lea Cáceres Díaz.

The paper is organized as follows: In Sect. 2 we briefly describe Topic Models. In Sect. 3, we perform a descriptive analysis and apply two Topic Models to the data, namely LDA and BPM. In Sect. 4 we describe the results and in the last section we conclude and delineate future work.

2 Topic Models

Topic Models, in a very concise way, are a specific type of statistical language models used for unveiling hidden structure in large collections of texts. Intuitively, we can think of it in different aspects:

  • Dimensionality Reduction, where rather than representing a text T in its feature space, you can represent it in a topic space.

  • Unsupervised Learning, where it can be compared to clustering. The number of topics, like the number of clusters, is an output parameter. By doing topic modeling, we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a specific weight.

  • Tagging, abstract “topics” that occur in a collection of documents that best represents the information in them.

There are several existing algorithms you can use to perform the topic modeling. The most common of it are, Latent Semantic Analysis (LSA/LSI) [4], Probabilistic Latent Semantic Analysis (pLSA) [7], and Latent Dirichlet Allocation (LDA) [2]. Topic modeling is the task of identifying topics automatically in a set of documents. This can be very useful for customer service automation, search engines and any other case where knowing the topics of documents is important. LDA [2] is a form of unsupervised learning that views documents as bags of words (where order does not matter). LDA works by first making a crucial assumption: the way a document was generated was by selecting a set of topics and then for each topic selecting a set of words. In order to do this it does the following for each document m:

  • Assume there are k topics across all of the documents.

  • Distribute these k topics across document m (this distribution is known as \(\alpha \) and can be symmetric or asymmetric) by assigning each word a topic.

  • For each word w in document m, assume its topic is wrong but every other word is assigned the correct topic.

  • Probabilistically assign word w a topic based on two things:

    • what topics are in document m

    • how many times word w has been assigned a particular topic across all of the documents (this distribution is called \(\beta \))

  • Repeat this process a number of times for each document.

There have been several works on Topic Models applied to Twitter. LDA has been extended in several ways, and in particular for social networks and social media, a number of extensions to LDA have been proposed. For example, in [3] the authors proposed a novel probabilistic topic model to analyze text corpora and infer descriptions of the entities and of relationships between those entities on Wikipedia. The authors in [11] proposed a model to simultaneously discover groups among the entities and topics among the corresponding text. In [18] a model was introduced to incorporate LDA into a community detection process. In [10] and [17] we can find related work.

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The main reason lies in that traditional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In [16], the authors propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM the topics are learnt by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. The authors carry out extensive experiments on real-world short text collections. The results demonstrate that their approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, they find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of this new topic model.

Fig. 1.
figure 1

Number of tweets mentioning “#LasTesis” (and related words) from November 25, 2019 to January 5, 2020

3 Analysis

3.1 Descriptive Analysis

The data used in this study was collected from the micro blogging platform Twitter. Several hashtags related to the event were used in order to capture 627643 tweets between the 25th November 2019 and the 5th January 2020. This sample was obtained with the paid Twitter API, so we got the entire number of tweets that were shared in those dates. In November the total number of tweets were 111371 and the number of unique users was 54465. In December the number of tweets was 507193 with a total of 167464 users. In January we obtained 9079 from 7264 users. The total of unique users were 202797.

In Fig. 1, we can see the time series of the original messages and retweets. The time series has several peaks, achieving the maximum around the 8th of December. The highest peak in November is due to the replication of the performance of the feminist group in several cities in Chile. The highest peak in December was produced after the performance of the song in Turkey, were several woman were arrested by the police, due to the ‘crude’ language of the song. After that, there was a peak in the 16th of December, when women politicians of Turkey replicated the performance in the parliament. All the peaks are reflecting some activities of the real world, and we can see the backlash of this in this social network.

Fig. 2.
figure 2

Language distribution for the tweets from November 25, 2019 to January 5, 2020

Fig. 3.
figure 3

Normalized language distribution for the tweets from November 25, 2019 to January 5, 2020

Fig. 4.
figure 4

Word Cloud of the entire dataset

Fig. 5.
figure 5

Word Cloud of the Spanish tweets

Fig. 6.
figure 6

Word Cloud of the English tweets

In Fig. 2 we see the number of tweets and the language distribution. The majority of the tweets were written in Spanish, Turkish, English and Portuguese. We detected a total of 32 languages in the total tweets. Before the 7th of December the predominant language was Spanish, but after the performance in Istanbul, and the consequent violence from the police to the manifesters, the predominant language was Turkish. In Fig. 3 we can see the normalized graph were we can see that Spanish and Turkish were the most common languages.

In Fig. 4 we observe a Word Cloud of the entire dataset. The most common words were “violador” (rapist), “mujeres” (women), “camino” (path), “Chile” and “kadnlar” (from kadınların, that means women in Turkish). In Fig. 5 we observe that the most common words were “mujeres” (women), “violador” (rapist), “performance” (a word in English that is also used in Spanish), and “camino” (path). In Fig. 6 we see the Word Cloud for the tweets in English. In this case we have different words that are related with the performance in Turkey. The most common words were “women”, “protest”, “turkish”, “police”, “breakup” and “last”.

3.2 Sentiment Analysis

In this subsection we analyzed a sample of the data written in English. The algorithm used in this part was the SentiWordnet model [5]. Over 60000 tweets were written in English during the entire period. Using the SentiWordnet algorithm we obtained an estimate of the polarity of the comments. The percentage of negative tweets was \(8\%\), while the percentage of positive one was \(92\%\). So we could perform the analysis we filtered the words by using a POS tagger. We only used verbs, adjectives, adverbs and nouns.

3.3 Topic Models

In this section we separated the results obtained with LDA and BTM for English and Spanish, in order to analyze the resulting topics for each of these languages. The words in each topic are ordered by the probability of appearance in a given topic. The comparison between the two algorithms was proposed to see the different words that each of the results gave. With both methods there are some noticeable differences.

Spanish. In Figs. 7 and 8 we see the results of applying both algorithms, LDA and BTM, in the Spanish tweets corpus. Since BTM is more suited for short messages, we observe that the topics obtained with the latter algorithm are more related with real world events. The events are related with the Chilean context, since they refer to performances made in schools (“liceo”) and the one made by older women in front of Estadio Nacional, Santiago, Chile. In one topic there is also a reference on how this performance was replicated in other parts of the world.

Fig. 7.
figure 7

Five topics obtained with LDA for Spanish Tweets. The words are ordered by their probability of appearance in a given topic.

Fig. 8.
figure 8

Five topics obtained with Bi-term Topic Model for Spanish Tweets. The words are ordered by their probability of appearance in a given topic.

English. In Figs. 9 and 10 we see the results obtained with both algorithms in the English twitter corpus. The topics produced refer to the context in Turkey (results obtained from LDA and BTM) and France (result obtained from BTM). The events that are discussed in the twitter corpus mainly refer to the violent repression of the performance in the streets of Istanbul and after, the performance made in the parliament by women politicians.

4 Discussion

As can be seen in the previous section, we performed several analysis to the collected data. As can be seen in the descriptive analysis, the phenomenon was shared and retweeted thousand of times, showing us that the phenomenon became widespread in matter of days. The total amount of languages that we found in the Twitter corpus show that the performance was also replicated in several countries and cities. Also, it is noticeable that the majority of the things that were said about the movement were mainly positive (92%). In relation to the results obtained by both of the Topic Models, we observed that both of these algorithms were able to capture the real-world events that occurred in different parts of the world. In the Spanish corpus, we obtained as a result the events that occurred in Chile during the first week after the spring of the movement (Performance in schools and in Estadio Nacional), while in the corpus in English, we obtained as a result the events that occurred in Turkey, both in the streets of Istanbul and in the parliament.

Fig. 9.
figure 9

Five topics obtained with LDA for English Tweets. The words are ordered by their probability of appearance in a given topic.

Fig. 10.
figure 10

Five topics obtained with Bi-term Topic Model for English Tweets. The words are ordered by their probability of appearance in a given topic.

5 Conclusions

In this work we analyzed over half a million tweets written in various languages. It shows the widespread phenomenon of the performance made by the feminist collective Las Tesis. It shows, how this performance affected and influenced many feminist organizations in the world. The performance was replicated over 10 countries, and the song was translated in many languages. In order to analyze the discussion that this performance engaged in all the world we used to algorithms to create automatically different topics. We used LDA and BTM, in both Spanish and English, to establish what the users in Twitter were speaking about. We see that BTM creates more cohesive topics, since BTM has been shown to work better in shorter texts. As future work, we pretend to work together with Sentiment Analysis to create topics for positive and negative tweets. We also will work on Machine Learning models in order to automatically classify those tweets according to their sentiment, thus not relying on sentiment dictionaries.