Keywords

1 Background

Social media activity data, in the case of this paper Twitter account activity, can be understood as consisting of two primary components, metadata or demographics, and content data. Metadata involves external characteristics such as time of activity, time of account creation, location, type of platform used for activity, number of friends, followers, and more. Content data involves syntactic and semantic characteristics. The focus of this paper is on content data, in particular, content feature extraction that can be implemented on a large set of text data in order to enable categorization of types of activities and classification of activities as automated versus non-automated.

This paper is an attempt to build a model to make prediction based only on content data. Among various available modeling tools, Linear Regression Model, K-Nearest Neighbor Model and a Neural Nets Model, abbreviated as LR Model, KNN Model and NN Model respectively, were of the highest interest.

2 Method

2.1 Data

Twitter account activity data is available through the Twitter API (application program interface) which returns requests for random samples of data in JSON (JavaScript Object Notation) data structure containing both demographics and content.

Content data (tweets) are returned (in the JSON structure) as character strings of length 1 to 140 characters. They may be in any language or no language at all. Tweets can contain any combination of free text, emotions, chat-speak, hash tags, and URL’s. Twitter does not filter tweets for content (e.g., vulgarisms, hate speech).

A vector of text features is derived for each user. This is accomplished by deriving text features for each of the user’s tweets, then rolling them up. Therefore, one content feature vector is derived for each user from all of that user’s tweets.

The extraction of numeric features from text is a multi-step process:

  1. 1.

    Collect the User’s most recent (up to 200) tweet strings into a single set (a Thread).

  2. 2.

    Convert the thread text to upper case for term matching.

  3. 3.

    Scan the thread for the presence of emoticons, chat-speak, hash tags, URL’s, and vulgarisms, setting bits to indicate the presence/absence of each of these text artifacts.

  4. 4.

    Remove special characters from the thread to facilitate term matching.

  5. 5.

    Create a Redundancy Score for the Thread. This is done by computing and rolling up (sum and normalize) the pairwise similarities of the tweet strings within the thread using six metrics: Euclidean Distance, RMS-Distance, L1 Distance, L-Infinity Distance, Cosine Distance, and the norm-weighted average of the five distances.

The thread text feature vector contains vector component user scores based on features such as the emoticon flag, chat-speak flag, hash tag flag, URL flag, vulgarity flag, and the Redundancy score.

For this study, a sample of the activity of 8845 Twitter accounts containing 1,048,395 tweets was collected for content analysis.

A list of 23 potential content related features was created and calculated for each of the 8845 Twitter accounts in the sample. These features were used for modeling in this paper (Table 1).

Table 1. The list of 23 features for analysis

2.2 Software

XLMiner is used to sort out the most important features for model building and model assessment/validation.

2.3 Procedures

For the purpose of predicting whether a tweet was automated or not, a manual rating process of a sample tweet content coming from 101 active accounts was carried out. Of the 101 accounts, 65 were jointly classified as 35 bot accounts and 30 non-bot accounts with a high level of confidence. Those 65 accounts were then assigned a dependent variable value of 1 if identified as a bot, and 0 otherwise.

An analysis on the correlation of each of the 23 features to the dependent variable (bot or not) was carried out to identify the 10 most important predictive features (with the highest correlation scores) from this set of data with 65 accounts.

The BOT-NotBOT tags from the 65 manually tagged threads were extrapolated to the larger corpus of 8,845 threads using a population-weighted N-Nearest Neighbor Classifier having the 65-thread set as the standard. N was allowed to vary from 1 to 20; the tagging for N = 5 was chosen for the extrapolation, because it best matched the class proportions of the 65-thread standard.

XLminer’s Feature Selection tool was then used to identify the best subset of features to be used as input to a classification or prediction method from the extrapolated dataset. After the 10 most important features were selected, they were considered in conjunction with the set of important features obtained from the correlation analysis using the set of 65 manually tagged tweet accounts. A preliminary analysis using LR Model was performed to figure out the best subset of features for modeling by trial and error, where the process was aided by, but not limited to, the union of the two sets of best features obtained from the two preliminary analyses.

The resultant subset of best features was then used in building a LR Model, KNN Model and a NN Model. The set of 8845 data was split into two portions, where 60% became the training set and the remaining 40% became the validation set. The training set was used to build the model, and the validation set was used to evaluate the accuracy of each aforementioned model. Each model would try to predict whether a tweet account in the set was a bot or not, and then the result would be compared with the BOT-notBOT tag either flagged manually by readers or determined by the aforementioned extrapolation.

Cumulative gains charts for all models were plotted to evaluate the predictive power of each model.

3 Result

3.1 Feature Selection

The following predictors were returned as the 10 most important features, sorted in descending order of importance (Table 2):

Table 2. The list of most important features (common features highlighted)

The feature “good_cnt” was dropped as it correlates highly with “good_len”, due to the fact that more correctly spelled words implies more characters of correctly spelled words. The grammatical features “art”, “punc”, “adj” and “prep” were considered to be of little importance, and therefore were dropped. The feature “hash”, which did not appear in the top 10 features from both analysis, were included as it was deemed important by observing the tweet data. The following 9 features were singled out as the best features for model building:

  1. 1.

    tweets

  2. 2.

    redund

  3. 3.

    commnoun

  4. 4.

    propnoun

  5. 5.

    vulgar

  6. 6.

    hash

  7. 7.

    urls

  8. 8.

    emo_chat

  9. 9.

    good_len

3.2 Modeling

After training the models with the training set, the results given by the validation set were as follows (Tables 3, 4 and 5; Figs. 1, 2 and 3):

Table 3. a, b The results of LR model
Table 4. a, b The results of KNN model
Table 5. a, b The results of NN model
Fig. 1.
figure 1

The cumulative gains chart of LR model

Fig. 2.
figure 2

The cumulative gains chart of KNN model

Fig. 3.
figure 3

The cumulative gains chart of NN model

4 Discussion

4.1 Findings

We can see that overall percentage error for NN model lowest among the models built. Therefore we conclude that NN model is best for classifying the new tweet as BOT or NOTBOT. It has the lowest classification error rate (Table 6).

Table 6. The summary of results of all models

As observed from the results, there is more error related to classifying the designation BOT. If we want to correctly classify more BOTs we may need to lower the cut-off value than 0.5 which is default in XLMiner.

By observing the curvature of the cumulative gain charts of all models, it was evident that the predictive power of all models were better than a random guess without a model, as all curves lay above the baseline. This confirmed that all models had significant predictive power in determining whether a tweet account is automated or not.

4.2 Limitations

A number of significant limitations must be noted.

First, the data set may not be a representative sample of the current state of affairs when it comes to bot versus non-bot activity in the Twitter medium.

Second, the process of manually classifying a small set of accounts and reaching a consensus in roughly two-thirds of the cases may not be without errors.

Third, a larger set when obtained from the manual classification process may lead to different conclusions about content features and the type of modeling, such as penalized Logistic Regression, and other techniques. These methods, had they been supported by the data size, may have yielded more precise classification.

Fourth, concentrating on content, which probably provides the most predictive power, may still ignore some critical external features, and thus may not produce an optimal perspective.

4.3 Further Investigations

Future work may attempt to consider a mix of external and content features, calculated based on the activities in a large set of confirmed bot and non-bot accounts. This should enable a much more reliable subset of predictive or discriminating features, which in turn may lead to more reliable descriptive and predictive models.

5 Conclusion

This paper demonstrates one way by which content of social media activities may be processed in terms of mathematical “signatures” of different types of online behaviors that may be used for descriptive and predictive modeling of automated versus non-automated activities.