Keywords

1 Introduction

Nowadays, social networks are part of our lives, helping people to stay in touch with family and friends, businesses to reach their customers and researchers to better understand society. Social networks give users the possibility to spread information and potentially reach millions of people and have been successfully used by public figures like celebrities and politicians to engage their followers. For this reason developing strategies to reach the target audience in social networks has become a critical task.

With more than 280 million monthly active users, Twitter is one of the most popular social networks of today [1]. This online micro-blogging service allows users to publicly discuss any topic from politics to everyday-life issues using small messages called tweets. A user can follow another user to see his tweets and can retweet one of these tweets to share it with his followers. Also, It is possible for users to mention other users in their tweets adding expressions of the type @UserName and to tag tweets with key words called hashtags. Most of all Fortune 500 companies have created a Twitter account, but while many businesses have an online presence, they may not be effectively communicating with their target market and most users are annoyed by online advertisements [2]. To be more effective, marketers look for influencers to promote their marketing campaigns making the product propaganda go viral through the social network [3].

Studying influence patterns can help us to better understand why certain trends or innovations are adopted faster than others and how we could help advertisers and marketers to design more effective campaigns [4]. It is possible to define many influence indicators on Twitter and each one leads to a different user ranking [5]. In this work we are interested in two particular kinds of users, those whose influence goes beyond their own community and those who are more receptive to opinions coming from outside their own community. These kinds of users can help business to spread information more easily and to reach their target audience. Most users prefer to connect with people having similar points of views and liking similar topics, for this reason our target users can be seen as anomalies.

Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior [6]. Techniques to detect anomalies in networks can be used to identify telecoms fraud, money laundering, and people with unusual behaviors in human groups. We propose to use anomaly detection for identifying influential and open minded users in Twitter. The main contributions of our work are:

  • To propose a novel algorithm to identify users “influence” and “open mindedness”: We designed an unsupervised anomaly detection algorithm to analyze the Twitter network and identify users with influence in communities beyond their own and those more prone to consider external opinions.

  • It uses only structural information of the network: There are many works focused on analyzing Twitter user behavior and sentiment analysis. Most of these works analyze the content of tweets, users profile and even information external to Twitter, but our proposal only needs structural information, this is useful in scenarios with more privacy restriction on content.

The remainder of this paper is structured as follows: In Sect. 2, we explain how our work is related to state of the art. In Sect. 3 our proposal is presented. In Sect. 4, we analyze the results of our algorithm on Twitter data. Finally, in Sect. 5, we present conclusions and some open challenges are discussed.

2 Related Work

Influence detection on Twitter is an open task and there are many works focused on it, using different approaches and detecting different kinds of influence. Some of these studies focus on structural properties of the network. In [4] the ability of the number of followers, re-tweets and mentions to predict influence are analyzed. A new measure combining these properties is proposed in [5]. Furthermore, the problem is seen as an information propagation one by [3]. None of these works consider the community of the users in their analysis.

Some works analyze the content of the messages to help in identifying influence. In [7] this information is used to identify five different roles among influencers. An approach based on Machine Learning techniques and Social Network Analysis is proposed by [1]. Also, the authors use some structural measures that consider the community of the user, but they analyze it from a global perspective. A recommender system to identify users more prone to spread information given a request from a stranger is proposed by [8].

None of the previously mentioned works analyze the users in its community and do not take into consideration that all highest ranking influencers can be members of the same community. To overcome this problem we rank influence and open mindedness by community, identifying those individuals that can help to spread the desired message among different communities. To the best of our knowledge, ours is the first work approaching influence detection problem as an anomaly detection one.

There is much research focused on anomaly detection, but most of them are focused on vector data [6]. Due to the expressiveness of networks the interest in detecting anomalies in graphs has increased [9]. Only a small number of works are targeted to identify graph anomalies, analyzing each element in its community [10,11,12,13], but these works are focused on identifying anomalous vertices with numeric attributes and are not suitable for our problem. Our proposal is based on the InterScore algorithm [14], which was designed to discover mixing accounts in the Bitcoin network, but differentiates from it taking into consideration edge direction, and the sign of the dissimilarity among elements. In the next section we explain our proposal in detail.

3 Identifying Influence and Open Mindedness

Twitter is a mainly content-oriented social network, with communities evolving around topics instead of people, where retweets are the main content-based interaction [5]. For our analysis we use the retweet network where users are vertices and there is an edge from vertex v to vertex u if user v retweeted a tweet from u. Usually, people form communities of like-minded individuals and have a tendency to share more those tweets that are interesting from the community perspective and to ignore other points of view. This behavior can be a problem if we want to use a reduced number of individuals to spread a message that reaches even the small communities.

Our proposal to solve this problem is based on the InterScore anomaly detection algorithm [14] but deviates from it in two major aspects. First, the direction of edges has a semantical difference in Twitter. Users with an anomalous number of out-edges with users from other communities are more open to opinions external to its own group. On the other hand, users with an anomalous number of incoming edges from other communities are people whose opinions are interesting for people beyond the frontiers of its group. Second, a user can be anomalous if it has a number of inter-community edges far greater or far lower than the rest of the members from its community, but for the problem we are targeting only matter the users with an anomalous high number of inter-community edges. To tackle these two issues we propose a new algorithm called InterScoreDS which analyzes the inter-community links of vertices and considers edge direction and the sign of the difference in the outlierness score function. Due to the differences among user groups our algorithm analyzes each user in its community in an unsupervised fashion. As a result it returns an outlierness ranking of Twitter users.

Definition 1

(Outlier ranking). An outlier ranking from a graph G is a set \(R = \{(v, r) | v \in V, r \in [0,1]\}\) of tuples, each one containing a vertex from G and its outlierness score.

The input of our algorithm is a user graph \(G_U\) and two boolean values \(in\_edges\_analysis\) and \(negative\_anomalies\). These boolean values allow analysts to control the behavior of the algorithm and to better focus on the kind of outliers they want to identify, thus reducing false positives. In the first stage, the Louvain community detection method [15] is used on \(G_U\) to detect groups of related users, returning a clustering C of vertices from \(G_U\). Any state-of-the-art graph clustering algorithm could be used in this stage. We selected the Louvain method based in its performance and applicability to large graphs.

In the second stage, our algorithm iterates over each community \(C_i \in C\) and for each vertex calculates the number of inter-community links it has, using a function \(l:V \rightarrow \mathbb {R}\). Then, for each community \(C_i\) we compute the mean difference among the number of inter-community links from its elements as defined below:

$$\begin{aligned} IMD(C_i) = \dfrac{\sum _{v_j \in C_i} \sum _{v_k \in C_i, v_j \ne v_k} |l(v_j) - l(v_k)|}{|C_i|} \end{aligned}$$
(1)

Depending on analysts choice a function \(l_{in}\) to count inter-community in-edges or a function \(l_{out}\) for counting inter-community out-edges will be used. Thus, we obtain two different functions \(IMD_{in}(C_i)\) and \(IMD_{out}(C_i)\) respectively. Then, in the third stage, our algorithm iterates over the elements of each \(C_i\) and determines its anomaly score using the following function:

$$\begin{aligned} r(v, C_i) = \dfrac{\sum _{u \in C_i, u \ne v} d(v, u, C_i)}{|C_i|} \end{aligned}$$
(2)

where \(d: V \times V \times 2^V \rightarrow \{0,1\}\) is a function that determines if the inter-community links difference between two vertices is greater than its community mean. Depending on analysts choice, to focus on elements with an atypically large or low number of inter-community links, one of the functions defined below will be used:

$$\begin{aligned} d_{high}(v,u,C_i) = \left\{ \begin{array}{cccccc} 0 &{} &{} \text {if} &{} &{} |l(v) - l(u)| \le IMD(C_i) \wedge (l(v) - l(u)) < 0, &{} \\ 1 &{} &{} \text {if} &{} &{} |l(v) - l(u)| > IMD(C_i) \wedge (l(v) - l(u)) \ge 0. &{} \\ \end{array} \right. \end{aligned}$$
(3)
$$\begin{aligned} d_{low}(v,u,C_i) = \left\{ \begin{array}{cccccc} 0 &{} &{} \text {if} &{} &{} |l(v) - l(u)| \le IMD(C_i) \wedge (l(v) - l(u)) \ge 0, &{} \\ 1 &{} &{} \text {if} &{} &{} |l(v) - l(u)| > IMD(C_i) \wedge (l(v) - l(u)) < 0. &{} \\ \end{array} \right. \end{aligned}$$
(4)

These score functions measure with how percent of the community a user has a difference in the amount of inter-community links greater than the mean difference for that community. Furthermore, they take into consideration if the number of inter-community links is greater or lower than the mean. These functions adaptively rank users outlierness according to their context, and detect anomalies that cannot be identified from a global point of view. In Algorithm 1, the steps of the InterScoreDS method can be observed in more detail.

figure a

The InterScoreDS algorithm has the same \(O(V^2)\) computational complexity as the InterScore algorithm, where the outlierness score function is the most expensive stage of the algorithm. Despite that, because the quadratic scoring is performed independently on each community and social networks have many communities, in real scenarios the algorithm performs better than the quadratic worst case.

4 Experimental Results

In this section, we analyze the results of our algorithm on real data, using a set of tweets about 2016 United States presidential elections. Because re-tweets are the most important content-oriented interaction in Twitter [5], we used the tweets from the mentioned dataset to build the re-tweet networkFootnote 1. In our network, vertices represent users and there is an edge from vertex v to vertex u if the user v re-tweeted a tweet from user u. Furthermore, edges have a weight indicating how many times user v has re-tweeted user u. In Table 1, some properties of the network are displayed.

Table 1. Network properties

The re-tweet network is a sparse graph with a high number of communities. The difference of size among communities is also significant with big ones grouping most users, and small ones with only a few members. We have not a labeled dataset of influential users for using as ground truth. Because different measures lead to different perspectives about influence [4], we decided to use a Gaussian anomaly detection algorithm on the number of re-tweets as baseline for our comparisons.

In Fig. 1, we compare the inter-community in-edges and the in-degree from the top 20 outliers detected using InterScoreDS and the baseline Gaussian algorithm. It can be appreciated that our proposal in general detects users with a higher amount of re-tweets and inter-community links (notice scale difference in Fig. 1). This is because it gives analysts the option to focus only on those users with an abnormally large amount of inter-community links. On the other hand, the baseline algorithm also considers users with few re-tweets.

Fig. 1.
figure 1

Comparison of in-degree and inter-community links from the top 20 outliers

We analyzed the Top 10 outlying users identified by each algorithm and got interesting insights. InterScoreDS rated as influent users like ABC News Politics, CNN Politics, Huff Post Politics, and the presidential candidate Hillary Clinton. On the other hand, the Gaussian algorithm rated as most influent users like CNN Breaking News, The Wall Street Journal, and presidential candidate Bernie Sanders. These differences in ranking are because sites like CNN Breaking News and The Wall Street journal are sites with great influence, but in the politics domain they are consulted only by some communities while sites like CNN Politics influence more communities in the elections topic. Furthermore, the presidential candidate Hillary Clinton made a campaign based in diversity and targeting people from many social groups, while candidate Bernie Sanders was very influent but its influence reached only some communities.

The analysis of top 10 open minded users identified by each algorithm is more difficult because in most cases these users are not famous or well known. The most curious finding is that InterScoreDS finds more users self-identified as liberals or progressive compared to the Gaussian algorithm. Also, the Gaussian algorithm identified a user with only one tweet. These differences are due to our algorithm considering users who re-tweet tweets from other communities being people more open minded while the Gaussian algorithm only identify people who re-tweet a lot. Also, our algorithm focused on users performing an abnormally large number of re-tweets, ignoring users with almost not re-tweets.

Experiments show how InterScoreDS can identify influencers with the power to reach many communities. The algorithm gives more options to analysts for focusing on the outliers that are really interesting for them and to overcome some problems of traditional anomaly detection methods.

5 Conclusions

We proposed inter-community links as a measure to identify influence and open mindedness in Twitter users and designed a new anomaly detection algorithm for identifying those users in an unsupervised fashion and analyzing each user in its community. Furthermore, the algorithm was tested on real data and the different results between our approach and Gaussian anomaly detection were discussed.

We will focus on some challenges in future work like parallelizing our algorithm to increase the performance. Also, InterScoreDS can be applied in other social networks and other problems like spammer and bot detection, these are all interesting domains for future work.