1 Introduction

Bots are prevalent on social media and their malicious actions have been observed repeatedly. An example of this wide-spread activity of bots was seen during the US Presidential Election in 2016. Allcott and Gentzkow [1] extensively studied this event and reported that millions of pro-Trump and pro-Clinton fake stories were shared on Facebook, in part, by bot accounts. They also provide evidence that more than 40% of fake news sources use social media to spread their content.

Thus, researchers have put great effort into understanding bots and developing methods to detect them. In supervised bot detection methods, which are the focus of this work, a labeled dataset of bots and human users is available prior to training a machine learning classifier. Using these labels, we can learn characteristics, also know as features (described in detail in Sect. 2), that discriminate bots from humans and use them to build classifiers that predict class labels (bot or human). The classifiers are then tested on unobserved datasets and evaluated using two prominent metrics: precision and recall.

A common theme among previous bot detection methods is attempting to maximize precision [4, 9, 16]. This is one extreme: the sole purpose is to minimize false positives and avoid mistakenly marking a human user as bot. By doing this, detection methods avoid removing human users from the site but leave many bots undetected. The other extreme is eliminating bots from social media at the price of removing human users. This approach is not preferable either. A method for finding a trade-off between precision and recall is optimizing for \(F_1\) score which is the harmonic mean between precision and recall. Harmonic mean is dominated by the minimum of its arguments. Hence, \(F_1\) cannot become arbitrarily large when either precision or recall is unchanged and the other metric is increased. This prevents bot detection algorithms from landing on trivial solutions (marking all users as bots or humans) to gain high \(F_1\). However, considering the same weight for precision and recall in \(F_1\) prevents us from having control over the final values of either precision or recall. In other words, two classifiers are considered equally good if they have the same \(F_1\) regardless of the fact that one might result in higher recall and the other one a higher precision. The ideal case is finding a solution close to optimum \(F_1\) that allows us to focus on precision or recall depending on the application.

Fig. 1.
figure 1

Our goal is having a recall-focused approach close to the optimal \(F_1\).

To align with corporate goals (having a large number of active users and retaining human users by avoiding accidentally suspending their accounts), bot detection models with high precision are preferable. However, from a user’s perspective, both social media users and researchers alike, the preferable situation is encountering a minimum number of bots. So, in this case, high recall is preferred. In this work, we focus on developing a supervised algorithm aligned with a user’s perspective: a REcall FOCUSed bot detection model, REFOCUS. We use multiple real-world datasets to show how we can find a sweet spot between blindly optimizing for \(F_1\) or recall as shown in Fig. 1. We also compare REFOCUS with state-of-the-art bot detection models to show that focusing on recall does not necessarily result in overall performance deterioration in terms of \(F_1\).

2 Supervised Bot Detection Methods

To use supervised bot detection models, one must identify differences among bot and human users in terms of features such as content or activity in a labeled dataset. Then, a classifier is trained on the features and labels to distinguish bots from humans in an unobserved dataset. Different classification methods can be used for this purpose such as Support Vector Machines [11], Random Forests [9], and Neural Networks [8]. We describe some common user features below:

  • Content: the measures in this category focus on the content shared by users. Words, phrases [16], and topics [11] of social media posts can be a strong indicator of bot activity. Also, bots are motivated to persuade real users into visiting external sites operated by their controller, hence, share more URLs in comparison to human users [4, 13, 17]. Bots are observed to lack originality in their tweets and have large ratio of retweets/tweets [14].

  • Activity Patterns: Bots tweet in a “bursty” nature [4, 10], publishing many tweets in a short time and being inactive for a longer period of time. Bots also tend to have very regular (e.g. tweeting every 10 min) or highly irregular (randomized lapse) tweeting patterns over time [18].

  • Network Connections: bots connect to a large number of users hoping to receive followers back but the majority of human users do not reciprocate. Hence, bots tend to follow more users than follow them back [4].

As bots become more complex and harder to detect [5], bot detection methods incorporate a larger number of features and datasets with more samples of humans and bots. One of the recent approaches is BotOrNot [6]. This method exploits 1,150 features from five categories: user-based, friends, network, temporal, language, and sentiment [16]. The initial model was trained on a dataset of \(\sim \)40,000 bots and humans and has been updated multiple times using seven more datasets with total of \(\sim \)87,000 samples. The strength of this method has encouraged researchers to use it to label ground-truth datasets [7]. We will compare our proposed method with BotOrNot in Sect. 5.

3 Data for Supervised Bot Detection

To show the robustness of our model with respect to the language, topic, time, and labeling mechanism, we use three datasets represented in Table 1.

Table 1. Statistics of the datasets used in this study.

As seen in Table 1, in this work, we utilize three existing bot detection datasets. We describe how each of these raw datasets were collected and what they contain in Sect. 3.1. Then, we specify how we preprocessed the raw data using a content-based feature extraction method in Sect. 3.2.

3.1 Datasets

The first dataset is a honeypot dataset collected by Morstatter et al. [11] which we refer to as the Arabic Honeypot dataset. It was collected using a network of 9 honeypot accounts which tweeted Arabic phrases, as well as randomly following and retweeting each other. Any user who followed a honeypot was considered a bot, because bot behaviors are sporadic and provide no intelligent information to humans. For collecting a set of human users, the authors manually inspected users who tweeted same Arabic phrases as some of the bots, then crawled data for them and other users that the inspected users immediately followed; assuming that humans only follow other humans and not bots. In August 2018, we re-crawled this dataset using the tweet IDs shared by the authors.

Additionally, we employ two datasets introduced by Cresci et al. [5] in their previous work on detecting social bots on Twitter namely: test set #1 and test set #2. We call these datasets Social Spambots 1 and 2, respectively, in our work. Each dataset is a combination of social spambots and human users on Twitter. To collect the human user accounts, Cresci et al. contacted random users, asked a natural language question, and manually evaluated if the user was a human. Social Spambots 1 contains these genuine accounts plus social bots that were discovered during the 2014 Mayoral election in Rome, Italy which were used to retweet a candidate within minutes of his original posting. Social Spambots 2 includes the genuine accounts and social bots that advertised products on Amazon.com by deceitfully spamming URLs which point to the products. We obtained these datasets directly from the BotOrNot Bot Repository [6].

3.2 Feature Extraction

Bot accounts are created by malicious actors to serve specific purposes. Thus, their content can be a strong indicator to expose such potentially automated accounts. The problem with using content for bot detection is that the raw text features are of high dimensionality and sparse. Inspired by the recent advances of topic modeling, we adopt latent Dirichlet allocation (LDA) [3] to obtain a topic representation of each user. LDA, which treats each document as a distribution over topics and each topic as a distribution over the vocabulary in the dataset, has been proven useful for extracting latent semantics of documents. As such, we use LDA to extract features from users’ tweet content. In this work, each user is considered one document and the content of that document is his tweets. We trained separate LDA models on each of the three datasets and develop classifiers independently. We follow the assumption that, since bots are naturally more interested in certain topics, denoting each user as a distribution over different topics may help to better identify them from regular accounts [11].

4 A Recall-Focused Approach – REFOCUS

A bot detection classifier generates the probability of being a bot (belonging to the positive class) for each instance in the dataset. To assign a binary label to users, a classifier uses a threshold (commonly set to 0.5 [12]) to decide; if the probability of being a bot is more than the classification threshold then the user is labeled as a bot, otherwise as a human. Based on the assigned labels, classifiers can be evaluated using precision (P) and recall (R) illustrated in Fig. 2 and defined bellow:

$$\begin{aligned} P=\frac{tp}{tp+fp}, \quad R=\frac{tp}{tp+fn} \end{aligned}$$
(1)

Precision and recall can be independently maximized easily. A trivial approach for increasing the recall is lowering the classification threshold and classifying more users as bots. Alternatively, increasing the classification threshold results in labeling most users as humans, with only the unquestionably obvious bot users labeled as bots, and causes a trivial increase in precision. However, precision and recall are not independent from each other: increasing one might result in decreasing the other. One approach for finding a trade-off between precision and recall is using the \(F_\beta \) score. \(F_\beta \) is the weighted harmonic mean of precision and recall [15] and is defined as follows:

$$\begin{aligned} F_\beta = \frac{(1+\beta ^2)PR}{\beta ^2 P + R} \end{aligned}$$
(2)

With \(\beta \) values greater than one, \(\beta \) times more weight is put on recall and for values less than one, \(\beta \) times more weight is associated with precision.

Fig. 2.
figure 2

Illustration of true negative - (b): tn, false positive - (c): fp, true positive - (d): tp, and false negative - (e): fn for a classifier trained on dataset (a) when the classifier labels a subset of users as bots (positive class) - \(b_{cl}\) - and the rest as humans (negative class) - \(h_{cl}\).

4.1 Searching for a Trade-Off: Selecting \(\beta \)

Our goal is optimizing for recall, hence, we utilize \(F_\beta \) with \(\beta > 1\) to find the best classification threshold: a sweet spot between where \(F_1\) (overall performance) is maximized and where \(R=1\). The framework of our recall focused approach is presented in Fig. 3. We divide the dataset to 90% Train and 10% Test. Then, for ten iterations, we divide Train to 90% \(Train_i\) and 10% \(Val_i\) which are training and validation sets respectively; \(Train_i\) is 81% of the whole data and \(Val_i\) is 9%. In each iteration, we train a classifier \(C_i\) on \(Train_i\), change the classification threshold between 0.1 and 0.9 with 0.1 steps, and find the threshold that results in the highest \(F_\beta \) score on \(Val_i\); we call this threshold \(t_i\). After the tenth iteration, we get an average of the thresholds \(t_1\) to \(t_{10}\) to find the average threshold t. Then we train a classifier, C, on Train and using t, we find the precision, recall, and \(F_1\) score. We repeat this process ten times and report the average of precision, recall, and \(F_1\) scores as our final results.

Fig. 3.
figure 3

Framework for the proposed bot detection model, REFOCUS.

We need to test different values of \(\beta \) in the training phase to find the best classification threshold using \(F_\beta \). As we increase \(\beta \), precision has a non-increasing trend and recall has a non-decreasing trend. This happens because as we increase \(\beta \) we put more weight on recall in comparison to precision. More formally

$$\begin{aligned} R_{\beta _i} \ge R_{\beta _j} \;\;\;and\;\;\; P_{\beta _i} \le P_{\beta _j} \;\;\;\;\;\;if\;\;\; \beta _i > \beta _j \end{aligned}$$
(3)

Due to this non-increasing pattern of precision with increase of \(\beta \), we prefer to maintain a low \(\beta \) as long as we do not sacrifice the chance of achieving a higher recall with minor loss in precision. To find the right \(\beta \), we start from \(\beta =1\) and in each step we choose the current \(\beta \) as \(\beta _{opt}\) if

$$\begin{aligned} (R^\beta -R^{\beta _{opt}}) > (F_1^{\beta _{opt}}-F_1^\beta ) \end{aligned}$$
(4)

Meaning that we choose a larger \(\beta \) if the gain in R is more than the loss in \(F_1\).

5 Experiments

In this section we empirically investigate the performance of our proposed approach. First, we investigate the effect of \(\beta \) and then, we compare REFOCUS with baseline bot detection models in terms of P, R, and \(F_1\).

Fig. 4.
figure 4

Effect of \(\beta \) on precision (P), recall (R), and overall performance (\(F_1\)). In each dataset, we change \(\beta \) from 1 to 5, use \(F_\beta \) for finding the best classification threshold in the training phase and report P, R, and \(F_1\) on the test set.

5.1 Searching for the Right \(\beta \)

It is intuitive that using a \(F_\beta \) when \(\beta > 1\) for training a classifier helps us find the classification threshold that results in higher recall as compared to when \(\beta =1\). However, it raises two questions: (1) what is best value of \(\beta \) and can we increase it indefinitely to reach the highest recall possible? (2) Does the model trained using \(\beta >1\) still perform well in terms of \(F_1\) or we will drastically lose precision? We answer the first question here and the second one in Sect. 5.2.

We test our model on three datasets: Arabic Honeypot, Social Spambot 1, and 2. The results are shown in Fig. 4. In Social Spambots 1, we do not observe any change in the overall performance in terms of \(F_1\) as we change the \(\beta \). This can be due to the way this dataset was collected resulting in humans and bots being quite distinct from each other. This distinction causes the classifier perform well no matter what the threshold is. Hence, any of the \(F_\beta \) scores can be used to find the best classification threshold. In Arabic Honeypot and Social Spambot 2, we see some variations in precision, recall, and overall performance. \(\beta =2\) gives us the best trade-off between precision and recall because the loss in the overall performance is smaller that the gain in recall; in other words, the slope of recall line is larger than the slope of \(F_1\) line. Further increase in \(\beta \) does not provide enough gain on recall in comparison to the loss in the overall performance, hence, we stop at \(F_2\).

5.2 Testing the Overall Performance

For comparing the overall performance of REFOCUS with other bot detection methods, we need to decide on the number of topics in LDA and the classification model. Due to the similarity between our feature extraction and the one by Morstatter et al. [11] we follow their observation that 200 topics generated the highest \(F_1\) in the Arabic Honeypot dataset and set number of topics to 200.

We test multiple classification algorithms that are observed to have high performance in the problem of bot detection [2] to find the best fit for REFOCUS:

Table 2. Performance of REFOCUS when implemented using different classifiers.

Decision Tree, Random Forest, Logistic Regression, and SVM. We use Python Scikit-learn package [12] for implementation with default settings except \(max\_depth=1\) for Random Forest and \(max\_iter=1\) for Logistic Regression to avoid overfitting. As shown in Table 2, all classifiers achieve very similar (difference less than 0.5%) \(F_1\) score except for Random Forest that has lower performance. We choose SVM for the rest of our experiments because it has similar or higher R and similar \(F_1\). Worth mentioning that our method can be built on top of any classifier to help improve recall without sacrificing the overall performance.

We compare our proposed approach, REFOCUS, with two baselines:

  • SVM: REFOCUS uses SVM to train multiple classifiers on subsamples of the dataset and learns the best recall-precision trade-off using \(F_\beta \). Hence, we compare our method with SVM when its parameters are set to default and it generates the class labels (1 or −1) using 0.5 as threshold. Users are represented with 200 LDA topics and we use 10-fold cross validation.

  • BotOrNot [6, 16]: this supervised bot detection model exploits 1150 features in six categories: user-based, friends, network, temporal, content and language, and sentiment. The model uses a Random Forest classifier and is trained on multiple publicly available datasets. BotOrNot has been used for generating ground-truth due to its performance.

We perform two sets of experiments. In the first one, we use the Arabic Honeypot dataset. We use an LDA model with 200 topics to extract features from the dataset then we apply REFOCUS and report the results. However, using this dataset raises the concern that our approach might not perform as well on non-Arabic tweets. Hence we also perform the second experiment. We follow the same procedure but use the datasets that were collected by Cresci et al. [5]. These datasets (as explained in Sect. 3) have three advantages: they are among the most recent publicly available labeled datasets for bots and include newer bots, they use manual labeling which is different from the honeypot dataset, and a majority of the tweets are in English. Hence, by testing our approach on Cresci’s datasets, we show that our model performs well regardless of the language of tweets and is resilient to new bots that emerge on social media.

The results are presented in Table 3. For the experiments on Cresci’s datasets, we do not balance the classes due to small size of the data. Hence, we also include the ROC AUC in our results. The ROC AUC for a classifier that randomly assigns labels to instances is 0.5 regardless of the class balance and is a helpful metric to assess classifiers when the samples of one class are more than the other. Reserving the class imbalance is also helpful to mimic the real world scenario where bots are a small portion of all users on social media [16].

Table 3. Comparison between REFOCUS and baseline bot detection methods.

In the first experiment, on the Arabic Honeypot dataset, SVM has higher precision and lower recall in comparison to REFOCUS. The reason is that SVM only labels a user as bot if the predicted probability of being a bot for that user is over 0.5. However, our method learns the best threshold for optimizing recall while reaching a high \(F_1\). Hence REFOCUS chooses a lower threshold (0.35 in this case). This choice results in 2% lower \(F_1\), however, we are willing to tolerate this loss due to 6% gain in recall. BotOrNot performs considerably worse in this dataset in comparison to the Social Spambot datasets. The reason is that Social Spambot datasets have been used in training BotOrNot and it is expected for classifiers to have lower performance on unseen datasets (e.g. Arabic Honeypot).

In the second experiment, we test our method on two non-Arabic datasets which are obtained using a manual annotation method to show that our results are robust to variations in datasets such as language. In Social Spambots 1, SVM and our proposed approach perform almost identically with an slightly better recall in REFOCUS. The reason is that the differences between instances in human and bot classes are well captured by the classifiers to the extent that the classifier (either SVM or REFOCUS) are very confident in the labeling. Hence, each instance gets a high probability of being in its actual class and changing the threshold does not change the classification results much. We also observe that our approach outperforms BotOrNot. On Social Spambots 2, SVM and BotOrNot outperform our approach in precision and have lower recall, similar to the Arabic Honeypot dataset, because they are not is not designed to optimize on recall. \(F_1\) of our approach is similar to the baselines.

6 Conclusion and Future Directions

The dominant trend among the previously proposed methods for bot detection is solely focusing on precision, making sure that no human user is marked as a bot, or optimizing for \(F_1\). In this work, we showed that we can focus on recall of a bot detection model without sacrificing the overall performance. We tested our method on three real-word datasets and observed that using \(F_2\) score in the training phase results in finding the best classification threshold for optimizing recall and having high overall performance in terms of \(F_1\). In the future, we wish to explore the robustness of our method on translated datasets and also measure its effectiveness in discriminating different types of bots in a dataset.