Keywords

1 Introduction

Social networks become a preferential place for information propagation or opinions and to promote ideas [12]. The malicious accounts on social networks lead to serious risks [9]. When the malicious accounts are detected and blocked, they register some new accounts called sockpuppets to continue spreading information. Sockpuppets usually produce malicious and deceptive behavior, such as fraud [11], cyberbullying [2], hate speech [6], and rumors [8]. Therefore, sockpuppet detection is valuable and challenging research issue. We broadly define puppetmaster as an individual that manipulate more than one account.

Prior works on automatic sockpuppet detection have tended to focus on verbal [9], non-verbal [10] and network-structure [7] features. The verbal-based method identify the authorship attribution of sockpuppet [3] by extracting features that capture stylistic, grammatical, and formatting preferences of the authors on 77 groups in Wikipedia and comparing the writing style of account [9]. It assumes that sockpuppets have a similar linguistic preference, such as keywords and topic titles in online discussion forum [15]. [4] is based on byte-level n-grams which are language independent. However, smart puppetmasters would disguise by altering account profile and writing style. Thus non-verbal methods assume that the non-verbal behavior indicates the intention of puppetmasters, [13] extracts 11 features from contribution’s behavior of the accounts, and applies the community detection algorithm to detect sockpuppet group based on the action graph and relationship graph. But most non-verbal features are not fit for different platforms. Existing network structure-based detection methods are subjectively based on user views or emotional similarities. Bu et al. [1] proposed a sockpuppet detection algorithm based on authorship-identification techniques and relationship analysis. The relationships between two accounts are built if they have a similar attitude and similar writing styles. Besides, Kumar et al. [5] constructs the reply network on discussion community and observes that the nodes denoting sockpuppets were more central and highly active. Some community detection based methods have been proposed to leverage the network structure to detect sockpuppet. However, these existing methods almost ignore the propagation characteristic and structure.

In this work, we observe that the differences of propagation trees between sockpuppet and ordinary account which are unusual patterns ignored in previous works. Sockpuppet’ propagation tree contains more identical accounts and is unexpectedly wider and deeper than that of the ordinary ones. In addition, the sockpuppet tend to build similar propagation trees. To utilize these patterns of the observations, we construct the propagation tree to detect sockpuppet and extract a set of independent features from propagation tree to detect sockpuppet. To validate the effectiveness, we collect two real-world data sets from Sina WeiboFootnote 1. The experiment demonstrates that our method outperforms previous methods.

2 Problem Formulation

Suppose \(G=(V,E)\) be a social network, where V is a set of accounts, \(E \in V \times V\) is a set of repost relationship, and \(e_{vu}^i \in E\) denotes repost relationship of message i between account v and u(\(v, u \in V\)) which reflects propagation of information over G. We formally define the sockpuppet detection problem as: given a set of accounts \(U(U\subset V)\), it aims to classify account \(u_i\)(\(u_i\in U\)) as a sockpuppet account or ordinary account.

3 Observations

We engage in investigation of the difference sockpuppet and ordinary account. (1) Difference between sockpuppet and ordinary account. How difference between sockpuppet and ordinary account on dimensions of propagation tree? The number of identical nicknames. (2) The difference of pairwise accounts. Are the propagation behavior of two individual sockpuppets in the same sockpuppets group more similar than sockpuppet-ordinary account pair?

Difference Between Sockpuppet and Ordinary Account. Combined with Fig. 1b and c, the sockpuppet tend to participate in same discussion of post more than once, in order to maximize the influence of the post. According to structural character, the propagation tree of sockpuppet is deeper and highlights that the message is reposted by sockpuppet will be spread far (1.86 vs 1.75) and wider (4.15 vs 3.51).

Fig. 1.
figure 1

(a) Shows sockpuppet mainly retweets more than once and the ordinary account tend to do not repost it (2.03 vs 1.86). (b) Demonstrates that sockpuppet is more active than ordinary account (4.60 vs 3.13). (c) Illustrates that sockpuppet tend to participate hot discussion (6.09 vs 5.54).

Difference of Pairwise Accounts. Figure 2 shows the sockpuppets pair is more similar than others through three dimensions: size, depth, and width. It is reasonable that the pairwise sockpuppets behave similarly. It indicates that it is hard for puppetmaster to disguise their identity on propagation behavior.

Fig. 2.
figure 2

Sockpuppets pair (S-S) refers to two individual sockpuppets that belong to same sockpuppets group, sockpuppet-ordinary account pair (S-O) refers to two accounts that are sockpuppet account and ordinary account separately.

To sum up, we have several discoveries that sockpuppet tend to repost from the other sockpuppet and the message which is reposted by sockpuppet have a wider propagation range than ordinary account. The pairwise sockpuppets tend to behave similarly to each other, in order to enhance the influence of sockpuppets group opinion.

4 Methodology

4.1 Propagation Tree Construction

Similar to TwitterFootnote 2, there are two types of posts in Sina Weibo: original posts (tweets) and reposts (retweets). Each reposting log will represents an information propagation process, such as “wow//!B:wonderful//@C:lol”. Based on the practice of refereeing to another account in a tweet via “//@username” convention [14], we extract the usernames from reposting log and construct the propagation trees to represent the information propagation process of an account (Fig. 3).

Fig. 3.
figure 3

(a) Builds the propagation flow from reposting log. (b) Constructs an propagation tree based on the same root of the propagation flow. We merge the propagation flow of account A which repost from account C. We remove the propagation tree which contains only one node.

4.2 Sockpuppet Account Detection

Given an account u and constructed the propagation trees of account u. Our method capture propagation behavior features fall into tree types: average value, minimum value and standard deviation. The average value of dimension can be seen in the following term:

Number of posts (\({Np}_u\)): We count the size of set of propagation tree of account u(\(D_u\)). This is a typical feature that depicts the activity frequency of accounts in social network.

Average depth of propagation tree (\({Ad}_u\)): For this feature, we just count maximum depth \({dp}_i\) of \(d^u_i\). This presents the delay in the message i propagation of account u. \({Ad}_u=\sum _{i=0}^{{Nd}_u}\frac{{dp}_i}{{Nd}_u}\), where \({Nd}_u\) is the size of \(D_u\).

Average size of propagation tree (\({As}_u\)): We count the total number of account (\({ds}_i\)) of propagation tree of the original message i which account u latest participated (\(d^u_i\)). While this feature is trying to capture the coverage of message i which the account u is participated in: \({As}_u=\sum _{i=0}^{{Nd}_u}\frac{{ds}_i}{{Nd}_u}\)

Average number of identical account in tree (\({Au}_u\)): The goal of this features \({dn}_i\) which is the number of the same nickname of \(d^u_i\) is to model the participation rates of account in the \(d^u_i\). Some accounts prefer to interact with others account by reposting their posts: \({Au}_u=\sum _{i=0}^{{Nd}_u}\frac{{dn}_i}{{Nd}_u}\)

Average maximum depth and width (\({Ad}_u\), \({Aw}_u\)): Maximum depth \({dd}_i\) is used for presenting one of dimensions of \(d^u_i\): \({Ad}_u=\sum _{i=0}^{{Nd}_u}\frac{{dd}_i}{{Nd}_u}\). And maximum width \({dw}_i\) is also used for presenting one of dimensions of \(d^u_i\): \({Aw}_u=\sum _{i=0}^{{Nd}_u}\frac{{dw}_i}{{Nd}_u}\)

Average Depth of only one 1-hop repost of original post (\({Ah}_u\)): These feature present the depth \({dh}_i\) of \(d^u_i\) with only one child. \({Ah}_u=\sum _{i=0}^{{Nd}_u}\frac{{dh}_i}{{Nd}_u}\)

Average number of children of propagation tree (\({Ac}_u\)): We take into consideration the number of children \({dc}_i\), which represents the diversity of \(d^u_i\). We contain the propagation tree with single child: \({Ac}_u=\sum _{i=0}^{{Nd}_u}\frac{{dc}_i}{{Nd}_u}\)

Average index of type of posts (\({Pm}_u\)): The type of posts \(p_t\) can be divided three types with index of type: posting (1), replying (2) and reposting (3). \({Pm}_u=\sum _{t=0}^{{Np}_u}\frac{{p_t}}{{Np}_u}\)

Average interval between interactions (\({Pi}_u\)): This is a normalized feature where we compute the time difference between the t-th post \(p_t\) and the prior one \(p_{t-1}\). It presents the frequency of which the account u uses the social network: \({Pi}_u=\sum _{i=0}^{{Np}_u}\frac{p_t-p_{t-1}}{{Np}_u}\).

5 Experimental

5.1 Experimental Setup

Datasets. We conduct experiments on two real-world \(\mathcal {D}_{\mathcal {S}}\) and \(\mathcal {D}_{\mathcal {T}}\) which we crawled tweets from 2017.01 to 2018.10. from Sina Weibo. Accounts are identified as sockpuppets when self-reported sentence pattern such as “This is a sockpuppet of Mix” is matched or other accounts identify them as being controlled by a puppetmaster. Ordinary accounts are randomly selected from the accounts interact with sockpuppets and are not correlated to sockpuppets.

Comparison Method. We consider the following baselines in sockpuppet detection. Profile Attributes Features: User profile is the basic information for each account, such as nickname and description. It reflects the lexical preference of puppetmaster. We employ attributes of accounts’ homepage and the number of diversity of login device for sockpuppets detection problem. Verbal Features (Verbal) [9]: The basis of authorship attributes sockpuppets detection in Wikipedia tries to identify the sockpuppet pair by comparing writing style. It extracts 245 verbal features from each comment of account. Non-verbal Features (Non-verbal) [10]: It uses several variables to represent user behavior. Variables of online non-verbal behavior fall under time-independent behavior and time-dependent behavior. For all the methods, 10-fold cross validation is performed and the average results are reported.

5.2 Experimental Result and Discussion

We employ five widely used classification metrics for evaluation: precious (P), recall (R), F1-score (F1) and False Positive Rate (FPR). The Table 1 compares several baseline methods and our proposed method over several machine learning algorithms: Logistic regression (LR), Support Vector Machine (SVM), Random Forest (RF), and Adaptive Boosting (ADA). It shows that we obtained the best F1-score using the LR algorithm on different datasets and the LR algorithm appears the most robust among several methods.

Table 1. Sockpuppet accounts detection

Due to some of the malicious sockpuppets are blocked, we cannot access their profile and some puppetmaster will apply diverse profile information in the same sockpuppets groups, the Profile Attributes Based method have the worst performance. Verbal Based method identifies sockpuppet through their linguistic traits which assume that sockpuppet have unique linguistic traits, because smart account could apply different writing style to express their idea. Non-verbal Based method outperform the Verbal Features method. A plausible explanation is that non-verbal cues are more powerful than verbal cues to characterize account. Our method provides better performance, which achieve the best performance in sockpuppet detection. It indicates that the propagation features based method could capture the sockpuppets’ intention.

6 Conclusion

We investigate the difference between the sockpuppet and ordinary account and extract several features from the propagation tree structure to achieve the goal of sockpuppet detection. Then we evaluate the proposed methods on two real-world social network datasets over two subproblems. Compared with several methods, our model shows the best performance.