Analyzing #LasTesis Feminist Movement in Twitter Using Topic Models

Rodriguez, Sebastian; Allende-Cid, Héctor; Gonzalez, Cristian; Alfaro, Rodrigo; Elortegui, Claudio; Palma, Wenceslao; Santander, Pedro

doi:10.1007/978-3-030-49570-1_44

Sebastian Rodriguez⁹,
Héctor Allende-Cid⁹,
Cristian Gonzalez⁹,
Rodrigo Alfaro⁹,
Claudio Elortegui⁹,
Wenceslao Palma⁹ &
…
Pedro Santander⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12194))

Included in the following conference series:

International Conference on Human-Computer Interaction

5718 Accesses
3 Citations

Abstract

Nowadays, social networks have created a massive mean of communication, that was unthinkable many years ago. Informal communication, blogging, and online discussions have transformed the Web into a huge repository of remarks on numerous themes, producing a potential wellspring of data for various areas. In this paper we analyze, using Topic Models, a recent widespread feminist movement. Las Tesis is a feminist collective that initiated a protest against sexual abuse, and that was replicated in more than dozen different countries in matter of days. We use LDA and BTM to detect automatically the topics in over 627643 tweets that were gathered from the 25th November until the 5th January. The resulting topics obtained, from tweets in Spanish and English, show that these algorithms are able to capture the real-world events that occurred in Chile and Turkey.

You have full access to this open access chapter, Download conference paper PDF

Emerging Topics in Brexit Debate on Twitter Around the Deadlines

Article Open access 22 July 2020

Exploring Twitter discourse with BERTopic: topic modeling of tweets related to the major German parties during the 2021 German federal election

Article Open access 29 October 2024

Topic Modeling of Twitter Conversations: The Case of the National University of Colombia

Keywords

1 Introduction

Nowadays, social networks have created a mean of communication, that was unprecedented years ago. Informal communication, blogging, and online discussions have transformed the Web into a huge archive of remarks on numerous themes, producing a potential wellspring of data for various areas. The accessibility of large–scale electronic social information from the Web and other electronic means is as of now changing how people nowadays communicate [14]. The social networks are also being used for other objectives, for example, consequently separating client opinions about products or brands [15], nowcasting earthquakes [12] and detecting suicidality [13].

Twitter, one of the most used Social Networks, can be depicted as a informal community website that captures messages of 280 characters. This micro-blogging service, provides users with a framework for writing brief, often-noisy postings about different subjects. These posts are called “Tweets”. It is for blogging on the grounds that the focal action is posting short announcement messages (tweets) by means of the Web or handheld device. Twitter is additionally an interpersonal organization site since individuals have a profile page and those individuals can be associated with different individuals by “following” them. A common element of Twitter is retweeting: sending a tweet by posting it once more. The reposting of the equivalent (or comparative) data works since individuals will in general follow various arrangements of individuals, in spite of the fact that retweeting likewise fills different needs. For example, helping supporters to discover more established posts. Another element of Twitter (and other social networks) is the hashtag: a metatag starting with # that is intended to help other people discover a post, regularly by denoting the Tweet theme or its target group. This component appears to have been created by Twitter clients, in mid 2008 [8]. The utilization of hashtags stresses the significance of generally conveying data in Twitter. Conversely, the character is utilized to deliver a post to another enrolled Twitter client, permitting Twitter to be used successfully for discussions and coordinated effort.

In order to analyze and extract semantic information about this huge amount of data generated from this microblogging platform, automatic methods are necessary. In this sense, Topic Models are a very useful tool for this purpose. Topic models are statistically inspired and unravel the hidden structure in large collections of texts.

In this paper we analyze the social impact of the performance “A rapist in your path” (Un violador en tu camino) proposed by the feminist collective Las Tesis. Although the performance started in several cities in Chile, this performance has been also replicated in different cities around the world. Some of this cities were Paris, London, Barcelona, New York, Mexico City, Istanbul, Madrid, Berlin and Bogotá. This street art intervention greatly exceeded national borders and has brought together hundreds of women around the world, who have organized to replicate the choreography and song created by four women from Valparaíso, Daffne Valdés Vargas, Sibila Sotomayor Van Rysseghem, Paula Cometa Stange and Lea Cáceres Díaz.

The paper is organized as follows: In Sect. 2 we briefly describe Topic Models. In Sect. 3, we perform a descriptive analysis and apply two Topic Models to the data, namely LDA and BPM. In Sect. 4 we describe the results and in the last section we conclude and delineate future work.

2 Topic Models

Topic Models, in a very concise way, are a specific type of statistical language models used for unveiling hidden structure in large collections of texts. Intuitively, we can think of it in different aspects:

Dimensionality Reduction, where rather than representing a text T in its feature space, you can represent it in a topic space.
Unsupervised Learning, where it can be compared to clustering. The number of topics, like the number of clusters, is an output parameter. By doing topic modeling, we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a specific weight.
Tagging, abstract “topics” that occur in a collection of documents that best represents the information in them.

There are several existing algorithms you can use to perform the topic modeling. The most common of it are, Latent Semantic Analysis (LSA/LSI) [4], Probabilistic Latent Semantic Analysis (pLSA) [7], and Latent Dirichlet Allocation (LDA) [2]. Topic modeling is the task of identifying topics automatically in a set of documents. This can be very useful for customer service automation, search engines and any other case where knowing the topics of documents is important. LDA [2] is a form of unsupervised learning that views documents as bags of words (where order does not matter). LDA works by first making a crucial assumption: the way a document was generated was by selecting a set of topics and then for each topic selecting a set of words. In order to do this it does the following for each document m:

Assume there are k topics across all of the documents.
Distribute these k topics across document m (this distribution is known as \(\alpha \) and can be symmetric or asymmetric) by assigning each word a topic.
For each word w in document m, assume its topic is wrong but every other word is assigned the correct topic.
Probabilistically assign word w a topic based on two things:
- what topics are in document m
- how many times word w has been assigned a particular topic across all of the documents (this distribution is called \(\beta \))
Repeat this process a number of times for each document.

There have been several works on Topic Models applied to Twitter. LDA has been extended in several ways, and in particular for social networks and social media, a number of extensions to LDA have been proposed. For example, in [3] the authors proposed a novel probabilistic topic model to analyze text corpora and infer descriptions of the entities and of relationships between those entities on Wikipedia. The authors in [11] proposed a model to simultaneously discover groups among the entities and topics among the corresponding text. In [18] a model was introduced to incorporate LDA into a community detection process. In [10] and [17] we can find related work.

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The main reason lies in that traditional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In [16], the authors propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM the topics are learnt by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. The authors carry out extensive experiments on real-world short text collections. The results demonstrate that their approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, they find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of this new topic model.

3 Analysis

3.1 Descriptive Analysis

The data used in this study was collected from the micro blogging platform Twitter. Several hashtags related to the event were used in order to capture 627643 tweets between the 25th November 2019 and the 5th January 2020. This sample was obtained with the paid Twitter API, so we got the entire number of tweets that were shared in those dates. In November the total number of tweets were 111371 and the number of unique users was 54465. In December the number of tweets was 507193 with a total of 167464 users. In January we obtained 9079 from 7264 users. The total of unique users were 202797.

In Fig. 1, we can see the time series of the original messages and retweets. The time series has several peaks, achieving the maximum around the 8th of December. The highest peak in November is due to the replication of the performance of the feminist group in several cities in Chile. The highest peak in December was produced after the performance of the song in Turkey, were several woman were arrested by the police, due to the ‘crude’ language of the song. After that, there was a peak in the 16th of December, when women politicians of Turkey replicated the performance in the parliament. All the peaks are reflecting some activities of the real world, and we can see the backlash of this in this social network.

In Fig. 2 we see the number of tweets and the language distribution. The majority of the tweets were written in Spanish, Turkish, English and Portuguese. We detected a total of 32 languages in the total tweets. Before the 7th of December the predominant language was Spanish, but after the performance in Istanbul, and the consequent violence from the police to the manifesters, the predominant language was Turkish. In Fig. 3 we can see the normalized graph were we can see that Spanish and Turkish were the most common languages.

In Fig. 4 we observe a Word Cloud of the entire dataset. The most common words were “violador” (rapist), “mujeres” (women), “camino” (path), “Chile” and “kadnlar” (from kadınların, that means women in Turkish). In Fig. 5 we observe that the most common words were “mujeres” (women), “violador” (rapist), “performance” (a word in English that is also used in Spanish), and “camino” (path). In Fig. 6 we see the Word Cloud for the tweets in English. In this case we have different words that are related with the performance in Turkey. The most common words were “women”, “protest”, “turkish”, “police”, “breakup” and “last”.

3.2 Sentiment Analysis

In this subsection we analyzed a sample of the data written in English. The algorithm used in this part was the SentiWordnet model [5]. Over 60000 tweets were written in English during the entire period. Using the SentiWordnet algorithm we obtained an estimate of the polarity of the comments. The percentage of negative tweets was \(8\%\), while the percentage of positive one was \(92\%\). So we could perform the analysis we filtered the words by using a POS tagger. We only used verbs, adjectives, adverbs and nouns.

3.3 Topic Models

In this section we separated the results obtained with LDA and BTM for English and Spanish, in order to analyze the resulting topics for each of these languages. The words in each topic are ordered by the probability of appearance in a given topic. The comparison between the two algorithms was proposed to see the different words that each of the results gave. With both methods there are some noticeable differences.

Spanish. In Figs. 7 and 8 we see the results of applying both algorithms, LDA and BTM, in the Spanish tweets corpus. Since BTM is more suited for short messages, we observe that the topics obtained with the latter algorithm are more related with real world events. The events are related with the Chilean context, since they refer to performances made in schools (“liceo”) and the one made by older women in front of Estadio Nacional, Santiago, Chile. In one topic there is also a reference on how this performance was replicated in other parts of the world.

English. In Figs. 9 and 10 we see the results obtained with both algorithms in the English twitter corpus. The topics produced refer to the context in Turkey (results obtained from LDA and BTM) and France (result obtained from BTM). The events that are discussed in the twitter corpus mainly refer to the violent repression of the performance in the streets of Istanbul and after, the performance made in the parliament by women politicians.

4 Discussion

As can be seen in the previous section, we performed several analysis to the collected data. As can be seen in the descriptive analysis, the phenomenon was shared and retweeted thousand of times, showing us that the phenomenon became widespread in matter of days. The total amount of languages that we found in the Twitter corpus show that the performance was also replicated in several countries and cities. Also, it is noticeable that the majority of the things that were said about the movement were mainly positive (92%). In relation to the results obtained by both of the Topic Models, we observed that both of these algorithms were able to capture the real-world events that occurred in different parts of the world. In the Spanish corpus, we obtained as a result the events that occurred in Chile during the first week after the spring of the movement (Performance in schools and in Estadio Nacional), while in the corpus in English, we obtained as a result the events that occurred in Turkey, both in the streets of Istanbul and in the parliament.

5 Conclusions

In this work we analyzed over half a million tweets written in various languages. It shows the widespread phenomenon of the performance made by the feminist collective Las Tesis. It shows, how this performance affected and influenced many feminist organizations in the world. The performance was replicated over 10 countries, and the song was translated in many languages. In order to analyze the discussion that this performance engaged in all the world we used to algorithms to create automatically different topics. We used LDA and BTM, in both Spanish and English, to establish what the users in Twitter were speaking about. We see that BTM creates more cohesive topics, since BTM has been shown to work better in shorter texts. As future work, we pretend to work together with Sentiment Analysis to create topics for positive and negative tweets. We also will work on Machine Learning models in order to automatically classify those tweets according to their sentiment, thus not relying on sentiment dictionaries.

References

Bansal, N., Koudas, N.: BlogScope: a system for online analysis of high volume text streams. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 1410–1413. ACM Press, New York (2007)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). https://doi.org/10.1162/jmlr.2003.3.4-5.993
Article MATH Google Scholar
Chang, J., Boyd-Graber, J., Blei, D.M.: Connections between the lines: augmenting social networks with text. In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2009)
Google Scholar
Dumais, S.T.: Latent semantic analysis. Ann. Rev. Inf. Sci. Technol. 38, 188–230 (2005). https://doi.org/10.1002/aris.1440380105
Article Google Scholar
Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of 5th International Conference on Language Resources and Evaluation (LREC), Genoa, pp. 417–422 (2006)
Google Scholar
Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P.: How to build a WebFountain: an architecture for very large-scale text analytics. IBM Syst. J. 43(1), 64–77 (2004)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999) (1999)
Google Scholar
Huang, J., Thornton, K.M., Efthimiadis, E.N.: Conversational tagging in Twitter. In: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia (HT 2010), pp. 173–178. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1810617.1810647
Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: tweets as electronic word of mouth. J. Am. Soc. Inform. Sci. Technol. 60(11), 2169–2188 (2009)
Article Google Scholar
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: ICML 2009: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 665–672. ACM (2009)
Google Scholar
McCallum, A., Wang, X., Mohanty, N.: Joint group and topic discovery from relations and text. In: Airoldi, E., Blei, D.M., Fienberg, S.E., Goldenberg, A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 28–44. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73133-7_3
Chapter Google Scholar
Mendoza, M., Poblete, B., Valderrama, I.: Nowcasting earthquake damages with Twitter. EPJ Data Sci. 8(1), 1–23 (2019). https://doi.org/10.1140/epjds/s13688-019-0181-0
Article Google Scholar
O’Dea, B., Wan, S., Batterham, P.J., Calear, A.L., Paris, C., Christensen, H.: Detecting suicidality on Twitter. Internet Interv. 2(2), 183–188 (2015). ISSN 2214-7829. https://doi.org/10.1016/j.invent.2015.03.005
Savage, M., Burrows, R.: The coming crisis in empirical sociology. Sociology 41(5), 885–899 (2007)
Article Google Scholar
Voorveld, H.: Brand communication in social media: a research agenda. J. Advert., 1–13 (2019). https://doi.org/10.1080/00913367.2019.1588808
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456 (2013). https://doi.org/10.1145/2488388.2488514
Yu, D., Xu, D., Wang, D., Ni, Z.: Hierarchical topic modeling of Twitter data for online analytical processing. IEEE Access 7, 12373–12385 (2019). https://doi.org/10.1109/ACCESS.2019.2891902
Article Google Scholar
Zhang, H., Giles, C.L., Foley, H.C., Yen, J.: Probabilistic community discovery using hierarchical latent Gaussian mixture model. In: AAAI 2007: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 663–668 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile
Sebastian Rodriguez, Héctor Allende-Cid, Cristian Gonzalez, Rodrigo Alfaro, Claudio Elortegui, Wenceslao Palma & Pedro Santander

Authors

Sebastian Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Allende-Cid
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Alfaro
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Elortegui
View author publications
You can also search for this author in PubMed Google Scholar
Wenceslao Palma
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Santander
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Héctor Allende-Cid .

Editor information

Editors and Affiliations

Towson University, Towson, MD, USA
Gabriele Meiselwitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodriguez, S. et al. (2020). Analyzing #LasTesis Feminist Movement in Twitter Using Topic Models. In: Meiselwitz, G. (eds) Social Computing and Social Media. Design, Ethics, User Behavior, and Social Network Analysis. HCII 2020. Lecture Notes in Computer Science(), vol 12194. Springer, Cham. https://doi.org/10.1007/978-3-030-49570-1_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-49570-1_44
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49569-5
Online ISBN: 978-3-030-49570-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics