Classifying Tweets Using User Account Information

Khoury, John; Li, Charles; Lo, Chloe; Lee, Corinne; Rajwani, Shakeel; Woolfolk, David; Ahmed, Alexis-Walid; Crusov, Loredana; Pérez-Goicochea, Arnold; Romero, Christopher; French, Rob; Ribeiro, Vasco

doi:10.1007/978-3-319-58628-1_40

John Khoury¹⁵,
Charles Li¹⁶,
Chloe Lo¹⁷,
Corinne Lee¹⁷,
Shakeel Rajwani¹⁷,
David Woolfolk¹⁷,
Alexis-Walid Ahmed¹⁷,
Loredana Crusov¹⁷,
Arnold Pérez-Goicochea¹⁷,
Christopher Romero¹⁷,
Rob French¹⁷ &
…
Vasco Ribeiro¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10284))

Included in the following conference series:

International Conference on Augmented Cognition

2447 Accesses

Abstract

Twitter is a short-text message system developed 6 years ago. It now has more than 100 million users generating over 300 million tweets every day. Twitter accounts are used for diverse purposes, such as social, advertising, political, religious, benevolent or vicious ideologies, among other activities. These activities can be communicated by humans, a machine or a robot. The purpose of this paper is to build predictive models, such as Logistic Regression, K Nearest Neighbors and Neural Network in order to identify the best variables that help predict, based on the contents, whether the tweets are coming from a human or a machine with the least possible error.

You have full access to this open access chapter, Download conference paper PDF

Sentiment Analysis: Twitter Tweets Classification Using Machine Learning Approaches

Text Categorization Using Sentiment Analysis

Automatic Tweets Classification Under an Intelligent Agents Framework

Keywords

1 Background

Social media activity data, in the case of this paper Twitter account activity, can be understood as consisting of two primary components, metadata or demographics, and content data. Metadata involves external characteristics such as time of activity, time of account creation, location, type of platform used for activity, number of friends, followers, and more. Content data involves syntactic and semantic characteristics. The focus of this paper is on content data, in particular, content feature extraction that can be implemented on a large set of text data in order to enable categorization of types of activities and classification of activities as automated versus non-automated.

This paper is an attempt to build a model to make prediction based only on content data. Among various available modeling tools, Linear Regression Model, K-Nearest Neighbor Model and a Neural Nets Model, abbreviated as LR Model, KNN Model and NN Model respectively, were of the highest interest.

2 Method

2.1 Data

Twitter account activity data is available through the Twitter API (application program interface) which returns requests for random samples of data in JSON (JavaScript Object Notation) data structure containing both demographics and content.

Content data (tweets) are returned (in the JSON structure) as character strings of length 1 to 140 characters. They may be in any language or no language at all. Tweets can contain any combination of free text, emotions, chat-speak, hash tags, and URL’s. Twitter does not filter tweets for content (e.g., vulgarisms, hate speech).

A vector of text features is derived for each user. This is accomplished by deriving text features for each of the user’s tweets, then rolling them up. Therefore, one content feature vector is derived for each user from all of that user’s tweets.

The extraction of numeric features from text is a multi-step process:

1.
Collect the User’s most recent (up to 200) tweet strings into a single set (a Thread).
2.
Convert the thread text to upper case for term matching.
3.
Scan the thread for the presence of emoticons, chat-speak, hash tags, URL’s, and vulgarisms, setting bits to indicate the presence/absence of each of these text artifacts.
4.
Remove special characters from the thread to facilitate term matching.
5.
Create a Redundancy Score for the Thread. This is done by computing and rolling up (sum and normalize) the pairwise similarities of the tweet strings within the thread using six metrics: Euclidean Distance, RMS-Distance, L1 Distance, L-Infinity Distance, Cosine Distance, and the norm-weighted average of the five distances.

The thread text feature vector contains vector component user scores based on features such as the emoticon flag, chat-speak flag, hash tag flag, URL flag, vulgarity flag, and the Redundancy score.

For this study, a sample of the activity of 8845 Twitter accounts containing 1,048,395 tweets was collected for content analysis.

A list of 23 potential content related features was created and calculated for each of the 8845 Twitter accounts in the sample. These features were used for modeling in this paper (Table 1).

Table 1. The list of 23 features for analysis

Full size table

2.2 Software

XLMiner is used to sort out the most important features for model building and model assessment/validation.

2.3 Procedures

For the purpose of predicting whether a tweet was automated or not, a manual rating process of a sample tweet content coming from 101 active accounts was carried out. Of the 101 accounts, 65 were jointly classified as 35 bot accounts and 30 non-bot accounts with a high level of confidence. Those 65 accounts were then assigned a dependent variable value of 1 if identified as a bot, and 0 otherwise.

An analysis on the correlation of each of the 23 features to the dependent variable (bot or not) was carried out to identify the 10 most important predictive features (with the highest correlation scores) from this set of data with 65 accounts.

The BOT-NotBOT tags from the 65 manually tagged threads were extrapolated to the larger corpus of 8,845 threads using a population-weighted N-Nearest Neighbor Classifier having the 65-thread set as the standard. N was allowed to vary from 1 to 20; the tagging for N = 5 was chosen for the extrapolation, because it best matched the class proportions of the 65-thread standard.

XLminer’s Feature Selection tool was then used to identify the best subset of features to be used as input to a classification or prediction method from the extrapolated dataset. After the 10 most important features were selected, they were considered in conjunction with the set of important features obtained from the correlation analysis using the set of 65 manually tagged tweet accounts. A preliminary analysis using LR Model was performed to figure out the best subset of features for modeling by trial and error, where the process was aided by, but not limited to, the union of the two sets of best features obtained from the two preliminary analyses.

The resultant subset of best features was then used in building a LR Model, KNN Model and a NN Model. The set of 8845 data was split into two portions, where 60% became the training set and the remaining 40% became the validation set. The training set was used to build the model, and the validation set was used to evaluate the accuracy of each aforementioned model. Each model would try to predict whether a tweet account in the set was a bot or not, and then the result would be compared with the BOT-notBOT tag either flagged manually by readers or determined by the aforementioned extrapolation.

Cumulative gains charts for all models were plotted to evaluate the predictive power of each model.

3 Result

3.1 Feature Selection

The following predictors were returned as the 10 most important features, sorted in descending order of importance (Table 2):

Table 2. The list of most important features (common features highlighted)

Full size table

The feature “good_cnt” was dropped as it correlates highly with “good_len”, due to the fact that more correctly spelled words implies more characters of correctly spelled words. The grammatical features “art”, “punc”, “adj” and “prep” were considered to be of little importance, and therefore were dropped. The feature “hash”, which did not appear in the top 10 features from both analysis, were included as it was deemed important by observing the tweet data. The following 9 features were singled out as the best features for model building:

1.
tweets
2.
redund
3.
commnoun
4.
propnoun
5.
vulgar
6.
hash
7.
urls
8.
emo_chat
9.
good_len

3.2 Modeling

After training the models with the training set, the results given by the validation set were as follows (Tables 3, 4 and 5; Figs. 1, 2 and 3):

Table 3. a, b The results of LR model

Full size table

Table 4. a, b The results of KNN model

Full size table

Table 5. a, b The results of NN model

Full size table

4 Discussion

4.1 Findings

We can see that overall percentage error for NN model lowest among the models built. Therefore we conclude that NN model is best for classifying the new tweet as BOT or NOTBOT. It has the lowest classification error rate (Table 6).

Table 6. The summary of results of all models

Full size table

As observed from the results, there is more error related to classifying the designation BOT. If we want to correctly classify more BOTs we may need to lower the cut-off value than 0.5 which is default in XLMiner.

By observing the curvature of the cumulative gain charts of all models, it was evident that the predictive power of all models were better than a random guess without a model, as all curves lay above the baseline. This confirmed that all models had significant predictive power in determining whether a tweet account is automated or not.

4.2 Limitations

A number of significant limitations must be noted.

First, the data set may not be a representative sample of the current state of affairs when it comes to bot versus non-bot activity in the Twitter medium.

Second, the process of manually classifying a small set of accounts and reaching a consensus in roughly two-thirds of the cases may not be without errors.

Third, a larger set when obtained from the manual classification process may lead to different conclusions about content features and the type of modeling, such as penalized Logistic Regression, and other techniques. These methods, had they been supported by the data size, may have yielded more precise classification.

Fourth, concentrating on content, which probably provides the most predictive power, may still ignore some critical external features, and thus may not produce an optimal perspective.

4.3 Further Investigations

Future work may attempt to consider a mix of external and content features, calculated based on the activities in a large set of confirmed bot and non-bot accounts. This should enable a much more reliable subset of predictive or discriminating features, which in turn may lead to more reliable descriptive and predictive models.

5 Conclusion

This paper demonstrates one way by which content of social media activities may be processed in terms of mathematical “signatures” of different types of online behaviors that may be used for descriptive and predictive modeling of automated versus non-automated activities.

References

Alarifi, A., Alsaleh, M., Al-Salman, A.: Twitter turing test: identifying social machines. Inf. Sci. 372, 332–346 (2016). doi:10.1016/j.ins.2016.08.036
Article Google Scholar
Azaria, A., Durst, S., Ferrara, E., Flammini, A., Galstyan, A., Kagan, V., Lerman, K., Menczer, F., Subrahmanian, V.S., Zhu, L.: The DARPA Twitter bot challenge. IEEE Comput. 49, 38–46 (2016)
Google Scholar
Benton, M.C., Radziwill, N.M.: Bot or Not? Deciphering Time Maps for Tweet Interarrivals. CoRR, abs/1605.06555 (2016)
Google Scholar
Carapinha, F., et al.: Modeling of social media behaviors using only account metadata. In: Schmorrow, D., Fidopiastis, M. (eds.) AC 2016. LNCS, vol. 9744, pp. 393–401. Springer, Cham (2016). doi:10.1007/978-3-319-39952-2_38
Google Scholar
Gilani, Z., Wang, L., Crowcroft, J., Almeida, M., Farahbakhsh, R.: Stweeler: a framework for Twitter bot analysis. In: Proceedings of the 25th International Conference Companion on World Wide Web - WWW 2016 Companion (2016). doi:10.1145/2872518.2889360
Kolesnikov, D.A., Ovchinnikov, G.V., Oseledets, I.V.: Algebraic reputation model RepRank and its application to spambot detection. CoRR, abs/1411.5995 (2014)
Google Scholar
Hancock, M.: Automating the characterization of social media culture, social context, and mood. In: 2014 Science of Multi-Intelligence Conference (SOMI), Chantilly, VA (2014)
Google Scholar
Hancock, M., Sessions, C., Lo, C., Rajwani, S., Kresses, E., Bleasdale, C., Strohschein, D.: Stability of a type of cross-cultural emotion modeling in social media. In: Schmorrow, D., Fidopiastis, C. (eds.) AC 2015. LNCS, vol. 9183, pp. 410–417. Springer, Cham (2015). doi:10.1007/978-3-319-20816-9_39
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Eastern Florida State College, Melbourne, FL, USA
John Khoury
Mercy College, Dobbs Ferry, NY, USA
Charles Li
Sirius17, Melbourne, FL, USA
Chloe Lo, Corinne Lee, Shakeel Rajwani, David Woolfolk, Alexis-Walid Ahmed, Loredana Crusov, Arnold Pérez-Goicochea, Christopher Romero, Rob French & Vasco Ribeiro

Authors

John Khoury
View author publications
You can also search for this author in PubMed Google Scholar
Charles Li
View author publications
You can also search for this author in PubMed Google Scholar
Chloe Lo
View author publications
You can also search for this author in PubMed Google Scholar
Corinne Lee
View author publications
You can also search for this author in PubMed Google Scholar
Shakeel Rajwani
View author publications
You can also search for this author in PubMed Google Scholar
David Woolfolk
View author publications
You can also search for this author in PubMed Google Scholar
Alexis-Walid Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Loredana Crusov
View author publications
You can also search for this author in PubMed Google Scholar
Arnold Pérez-Goicochea
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Romero
View author publications
You can also search for this author in PubMed Google Scholar
Rob French
View author publications
You can also search for this author in PubMed Google Scholar
Vasco Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shakeel Rajwani .

Editor information

Editors and Affiliations

SoarTech, Orlando, Florida, USA
Dylan D. Schmorrow
Design Interactive, Inc. , Orlando, Florida, USA
Cali M. Fidopiastis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khoury, J. et al. (2017). Classifying Tweets Using User Account Information. In: Schmorrow, D., Fidopiastis, C. (eds) Augmented Cognition. Neurocognition and Machine Learning. AC 2017. Lecture Notes in Computer Science(), vol 10284. Springer, Cham. https://doi.org/10.1007/978-3-319-58628-1_40

Download citation

DOI: https://doi.org/10.1007/978-3-319-58628-1_40
Published: 18 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58627-4
Online ISBN: 978-3-319-58628-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Classifying Tweets Using User Account Information

Abstract

Similar content being viewed by others

Sentiment Analysis: Twitter Tweets Classification Using Machine Learning Approaches

Text Categorization Using Sentiment Analysis

Automatic Tweets Classification Under an Intelligent Agents Framework

Keywords

1 Background