Identifying Twitter Users Influence and Open Mindedness Using Anomaly Detection

Prado-Romero, Mario Alfonso; Oliva, Alberto Fernández; Hernández, Lucina García

doi:10.1007/978-3-030-01132-1_19

Mario Alfonso Prado-Romero ORCID: orcid.org/0000-0002-0491-3515¹⁶,
Alberto Fernández Oliva¹⁶ &
Lucina García Hernández¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11047))

Included in the following conference series:

International Workshop on Artificial Intelligence and Pattern Recognition

1251 Accesses
5 Citations

Abstract

Social networks help us to connect and share our thoughts with family and friends. Businesses want to take advantage of social media to better reach their customers, but traditional advertising results annoying for most social network users. As a result, the use of influencers to help a message reach their target audience has become a topic of great interest. Despite the many works in this field, detecting influence in social networks is still an open topic. In this work we propose to use anomaly detection for finding “influential” and “open minded” individuals in the Twitter network. Targeting these users can help advertisers to reach closed communities and to increase the spread of their message.

Download conference paper PDF

Finding Correlation Between Twitter Influence Metrics and Centrality Measures for Detection of Influential Users

Leveraging GNNs and Node Entropy for Anomaly Detection: Revealing Misinformation Spreader on Twitter Network

A Survey on Influence and Information Diffusion in Twitter Using Big Data Analytics

Keywords

1 Introduction

Nowadays, social networks are part of our lives, helping people to stay in touch with family and friends, businesses to reach their customers and researchers to better understand society. Social networks give users the possibility to spread information and potentially reach millions of people and have been successfully used by public figures like celebrities and politicians to engage their followers. For this reason developing strategies to reach the target audience in social networks has become a critical task.

With more than 280 million monthly active users, Twitter is one of the most popular social networks of today [1]. This online micro-blogging service allows users to publicly discuss any topic from politics to everyday-life issues using small messages called tweets. A user can follow another user to see his tweets and can retweet one of these tweets to share it with his followers. Also, It is possible for users to mention other users in their tweets adding expressions of the type @UserName and to tag tweets with key words called hashtags. Most of all Fortune 500 companies have created a Twitter account, but while many businesses have an online presence, they may not be effectively communicating with their target market and most users are annoyed by online advertisements [2]. To be more effective, marketers look for influencers to promote their marketing campaigns making the product propaganda go viral through the social network [3].

Studying influence patterns can help us to better understand why certain trends or innovations are adopted faster than others and how we could help advertisers and marketers to design more effective campaigns [4]. It is possible to define many influence indicators on Twitter and each one leads to a different user ranking [5]. In this work we are interested in two particular kinds of users, those whose influence goes beyond their own community and those who are more receptive to opinions coming from outside their own community. These kinds of users can help business to spread information more easily and to reach their target audience. Most users prefer to connect with people having similar points of views and liking similar topics, for this reason our target users can be seen as anomalies.

Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior [6]. Techniques to detect anomalies in networks can be used to identify telecoms fraud, money laundering, and people with unusual behaviors in human groups. We propose to use anomaly detection for identifying influential and open minded users in Twitter. The main contributions of our work are:

To propose a novel algorithm to identify users “influence” and “open mindedness”: We designed an unsupervised anomaly detection algorithm to analyze the Twitter network and identify users with influence in communities beyond their own and those more prone to consider external opinions.
It uses only structural information of the network: There are many works focused on analyzing Twitter user behavior and sentiment analysis. Most of these works analyze the content of tweets, users profile and even information external to Twitter, but our proposal only needs structural information, this is useful in scenarios with more privacy restriction on content.

The remainder of this paper is structured as follows: In Sect. 2, we explain how our work is related to state of the art. In Sect. 3 our proposal is presented. In Sect. 4, we analyze the results of our algorithm on Twitter data. Finally, in Sect. 5, we present conclusions and some open challenges are discussed.

2 Related Work

Influence detection on Twitter is an open task and there are many works focused on it, using different approaches and detecting different kinds of influence. Some of these studies focus on structural properties of the network. In [4] the ability of the number of followers, re-tweets and mentions to predict influence are analyzed. A new measure combining these properties is proposed in [5]. Furthermore, the problem is seen as an information propagation one by [3]. None of these works consider the community of the users in their analysis.

Some works analyze the content of the messages to help in identifying influence. In [7] this information is used to identify five different roles among influencers. An approach based on Machine Learning techniques and Social Network Analysis is proposed by [1]. Also, the authors use some structural measures that consider the community of the user, but they analyze it from a global perspective. A recommender system to identify users more prone to spread information given a request from a stranger is proposed by [8].

None of the previously mentioned works analyze the users in its community and do not take into consideration that all highest ranking influencers can be members of the same community. To overcome this problem we rank influence and open mindedness by community, identifying those individuals that can help to spread the desired message among different communities. To the best of our knowledge, ours is the first work approaching influence detection problem as an anomaly detection one.

There is much research focused on anomaly detection, but most of them are focused on vector data [6]. Due to the expressiveness of networks the interest in detecting anomalies in graphs has increased [9]. Only a small number of works are targeted to identify graph anomalies, analyzing each element in its community [10,11,12,13], but these works are focused on identifying anomalous vertices with numeric attributes and are not suitable for our problem. Our proposal is based on the InterScore algorithm [14], which was designed to discover mixing accounts in the Bitcoin network, but differentiates from it taking into consideration edge direction, and the sign of the dissimilarity among elements. In the next section we explain our proposal in detail.

3 Identifying Influence and Open Mindedness

Twitter is a mainly content-oriented social network, with communities evolving around topics instead of people, where retweets are the main content-based interaction [5]. For our analysis we use the retweet network where users are vertices and there is an edge from vertex v to vertex u if user v retweeted a tweet from u. Usually, people form communities of like-minded individuals and have a tendency to share more those tweets that are interesting from the community perspective and to ignore other points of view. This behavior can be a problem if we want to use a reduced number of individuals to spread a message that reaches even the small communities.

Our proposal to solve this problem is based on the InterScore anomaly detection algorithm [14] but deviates from it in two major aspects. First, the direction of edges has a semantical difference in Twitter. Users with an anomalous number of out-edges with users from other communities are more open to opinions external to its own group. On the other hand, users with an anomalous number of incoming edges from other communities are people whose opinions are interesting for people beyond the frontiers of its group. Second, a user can be anomalous if it has a number of inter-community edges far greater or far lower than the rest of the members from its community, but for the problem we are targeting only matter the users with an anomalous high number of inter-community edges. To tackle these two issues we propose a new algorithm called InterScoreDS which analyzes the inter-community links of vertices and considers edge direction and the sign of the difference in the outlierness score function. Due to the differences among user groups our algorithm analyzes each user in its community in an unsupervised fashion. As a result it returns an outlierness ranking of Twitter users.

Definition 1

(Outlier ranking). An outlier ranking from a graph G is a set $R = \{(v, r) | v \in V, r \in [0,1]\}$ of tuples, each one containing a vertex from G and its outlierness score.

The input of our algorithm is a user graph $G_U$ and two boolean values $in\_edges\_analysis$ and $negative\_anomalies$. These boolean values allow analysts to control the behavior of the algorithm and to better focus on the kind of outliers they want to identify, thus reducing false positives. In the first stage, the Louvain community detection method [15] is used on $G_U$ to detect groups of related users, returning a clustering C of vertices from $G_U$. Any state-of-the-art graph clustering algorithm could be used in this stage. We selected the Louvain method based in its performance and applicability to large graphs.

In the second stage, our algorithm iterates over each community $C_i \in C$ and for each vertex calculates the number of inter-community links it has, using a function $l:V \rightarrow \mathbb {R}$. Then, for each community $C_i$ we compute the mean difference among the number of inter-community links from its elements as defined below:

$$\begin{aligned} IMD(C_i) = \dfrac{\sum _{v_j \in C_i} \sum _{v_k \in C_i, v_j \ne v_k} |l(v_j) - l(v_k)|}{|C_i|} \end{aligned}$$

(1)

Depending on analysts choice a function $l_{in}$ to count inter-community in-edges or a function $l_{out}$ for counting inter-community out-edges will be used. Thus, we obtain two different functions $IMD_{in}(C_i)$ and $IMD_{out}(C_i)$ respectively. Then, in the third stage, our algorithm iterates over the elements of each $C_i$ and determines its anomaly score using the following function:

$$\begin{aligned} r(v, C_i) = \dfrac{\sum _{u \in C_i, u \ne v} d(v, u, C_i)}{|C_i|} \end{aligned}$$

(2)

where $d: V \times V \times 2^V \rightarrow \{0,1\}$ is a function that determines if the inter-community links difference between two vertices is greater than its community mean. Depending on analysts choice, to focus on elements with an atypically large or low number of inter-community links, one of the functions defined below will be used:

$$\begin{aligned} d_{high}(v,u,C_i) = \left\{ \begin{array}{cccccc} 0 &{} &{} \text {if} &{} &{} |l(v) - l(u)| \le IMD(C_i) \wedge (l(v) - l(u)) < 0, &{} \\ 1 &{} &{} \text {if} &{} &{} |l(v) - l(u)| > IMD(C_i) \wedge (l(v) - l(u)) \ge 0. &{} \\ \end{array} \right. \end{aligned}$$

(3)

$$\begin{aligned} d_{low}(v,u,C_i) = \left\{ \begin{array}{cccccc} 0 &{} &{} \text {if} &{} &{} |l(v) - l(u)| \le IMD(C_i) \wedge (l(v) - l(u)) \ge 0, &{} \\ 1 &{} &{} \text {if} &{} &{} |l(v) - l(u)| > IMD(C_i) \wedge (l(v) - l(u)) < 0. &{} \\ \end{array} \right. \end{aligned}$$

(4)

These score functions measure with how percent of the community a user has a difference in the amount of inter-community links greater than the mean difference for that community. Furthermore, they take into consideration if the number of inter-community links is greater or lower than the mean. These functions adaptively rank users outlierness according to their context, and detect anomalies that cannot be identified from a global point of view. In Algorithm 1, the steps of the InterScoreDS method can be observed in more detail.

The InterScoreDS algorithm has the same $O(V^2)$ computational complexity as the InterScore algorithm, where the outlierness score function is the most expensive stage of the algorithm. Despite that, because the quadratic scoring is performed independently on each community and social networks have many communities, in real scenarios the algorithm performs better than the quadratic worst case.

4 Experimental Results

In this section, we analyze the results of our algorithm on real data, using a set of tweets about 2016 United States presidential elections. Because re-tweets are the most important content-oriented interaction in Twitter [5], we used the tweets from the mentioned dataset to build the re-tweet network^{Footnote 1}. In our network, vertices represent users and there is an edge from vertex v to vertex u if the user v re-tweeted a tweet from user u. Furthermore, edges have a weight indicating how many times user v has re-tweeted user u. In Table 1, some properties of the network are displayed.

Table 1. Network properties

Full size table

The re-tweet network is a sparse graph with a high number of communities. The difference of size among communities is also significant with big ones grouping most users, and small ones with only a few members. We have not a labeled dataset of influential users for using as ground truth. Because different measures lead to different perspectives about influence [4], we decided to use a Gaussian anomaly detection algorithm on the number of re-tweets as baseline for our comparisons.

In Fig. 1, we compare the inter-community in-edges and the in-degree from the top 20 outliers detected using InterScoreDS and the baseline Gaussian algorithm. It can be appreciated that our proposal in general detects users with a higher amount of re-tweets and inter-community links (notice scale difference in Fig. 1). This is because it gives analysts the option to focus only on those users with an abnormally large amount of inter-community links. On the other hand, the baseline algorithm also considers users with few re-tweets.

We analyzed the Top 10 outlying users identified by each algorithm and got interesting insights. InterScoreDS rated as influent users like ABC News Politics, CNN Politics, Huff Post Politics, and the presidential candidate Hillary Clinton. On the other hand, the Gaussian algorithm rated as most influent users like CNN Breaking News, The Wall Street Journal, and presidential candidate Bernie Sanders. These differences in ranking are because sites like CNN Breaking News and The Wall Street journal are sites with great influence, but in the politics domain they are consulted only by some communities while sites like CNN Politics influence more communities in the elections topic. Furthermore, the presidential candidate Hillary Clinton made a campaign based in diversity and targeting people from many social groups, while candidate Bernie Sanders was very influent but its influence reached only some communities.

The analysis of top 10 open minded users identified by each algorithm is more difficult because in most cases these users are not famous or well known. The most curious finding is that InterScoreDS finds more users self-identified as liberals or progressive compared to the Gaussian algorithm. Also, the Gaussian algorithm identified a user with only one tweet. These differences are due to our algorithm considering users who re-tweet tweets from other communities being people more open minded while the Gaussian algorithm only identify people who re-tweet a lot. Also, our algorithm focused on users performing an abnormally large number of re-tweets, ignoring users with almost not re-tweets.

Experiments show how InterScoreDS can identify influencers with the power to reach many communities. The algorithm gives more options to analysts for focusing on the outliers that are really interesting for them and to overcome some problems of traditional anomaly detection methods.

5 Conclusions

We proposed inter-community links as a measure to identify influence and open mindedness in Twitter users and designed a new anomaly detection algorithm for identifying those users in an unsupervised fashion and analyzing each user in its community. Furthermore, the algorithm was tested on real data and the different results between our approach and Gaussian anomaly detection were discussed.

We will focus on some challenges in future work like parallelizing our algorithm to increase the performance. Also, InterScoreDS can be applied in other social networks and other problems like spammer and bot detection, these are all interesting domains for future work.

Notes

1.
Data set available at https://drive.google.com/drive/folders/1f5IazToQKAIgFx1kssiKOTSYLyf7jPNV?usp=sharing.

References

Cossu, J.V., Dugué, N., Labatut, V.: Detecting real-world influence through Twitter. In: 2015 Second European Network Intelligence Conference (ENIC) (2015)
Google Scholar
Pikas, B., Sorrentino, G.: The effectiveness of online advertising: consumer’s perceptions of ads on Facebook, Twitter and Youtube. J. Appl. Bus. Econ. 16, 70 (2014)
Google Scholar
Jendoubi, S., Martin, A., Liétard, L., Hadji, H.B., Yaghlane, B.B.: Two evidential data based models for influence maximization in Twitter. Knowl.-Based Syst. 121, 58–70 (2017)
Article Google Scholar
Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in Twitter: the million follower fallacy. Icwsm 10, 30 (2010)
Google Scholar
Anger, I., Kittl, C.: Measuring influence on Twitter. In: Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies (2011)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41, 15 (2009)
Article Google Scholar
Chen, C., Gao, D., Li, W., Hou, Y.: Inferring topic-dependent influence roles of Twitter users. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (2014)
Google Scholar
Lee, K., Mahmud, J., Chen, J., Zhou, M., Nichols, J.: Who will retweet this?: Automatically identifying and engaging strangers on Twitter to spread information. In: Proceedings of the 19th International Conference on Intelligent User Interfaces (2014)
Google Scholar
Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29, 626–688 (2015)
Article MathSciNet Google Scholar
Gao, J., Liang, F., Fan, W., Wang, C., Sun, Y., Han, J.: On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010)
Google Scholar
Müller, E., Sánchez, P.I., Mülle, Y., Böhm, K.: Ranking outlier nodes in subspaces of attributed graphs. In: 2013 IEEE 29th International Conference on Data Engineering Data Engineering Workshops (ICDEW) (2013)
Google Scholar
Prado-Romero, M.A., Gago-Alonso, A.: Detecting contextual collective anomalies at a glance. In: Proceedings of the 23rd International Conference on Pattern Recognition (ICPR). (2016)
Google Scholar
Prado-Romero, M.A., Gago-Alonso, A.: Community feature selection for anomaly detection in attributed graphs. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds.) CIARP 2016. LNCS, vol. 10125, pp. 109–116. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52277-7_14
Chapter Google Scholar
Prado-Romero, M.A., Doerr, C., Gago-Alonso, A.: Discovering bitcoin mixing using anomaly detection. In: Mendoza, M., Velastín, S. (eds.) CIARP 2017. LNCS, vol. 10657, pp. 534–541. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75193-1_64
Chapter Google Scholar
Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Havana, Sán Lázaro and L, Vedado, 10400, Havana, Cuba
Mario Alfonso Prado-Romero, Alberto Fernández Oliva & Lucina García Hernández

Authors

Mario Alfonso Prado-Romero
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Fernández Oliva
View author publications
You can also search for this author in PubMed Google Scholar
Lucina García Hernández
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mario Alfonso Prado-Romero .

Editor information

Editors and Affiliations

Universidad de las Ciencias Informáticas, Havana, Cuba
Yanio Hernández Heredia
Universidad de las Ciencias Informáticas, Havana, Cuba
Vladimir Milián Núñez
Universidad de las Ciencias Informáticas, Havana, Cuba
José Ruiz Shulcloper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prado-Romero, M.A., Oliva, A.F., Hernández, L.G. (2018). Identifying Twitter Users Influence and Open Mindedness Using Anomaly Detection. In: Hernández Heredia, Y., Milián Núñez, V., Ruiz Shulcloper, J. (eds) Progress in Artificial Intelligence and Pattern Recognition. IWAIPR 2018. Lecture Notes in Computer Science(), vol 11047. Springer, Cham. https://doi.org/10.1007/978-3-030-01132-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-01132-1_19
Published: 22 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01131-4
Online ISBN: 978-3-030-01132-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics