Twitter spam account detection based on clustering and classification methods

Adewole, Kayode Sakariyah; Han, Tao; Wu, Wanqing; Song, Houbing; Sangaiah, Arun Kumar

doi:10.1007/s11227-018-2641-x

Twitter spam account detection based on clustering and classification methods

Published: 10 October 2018

Volume 76, pages 4802–4837, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kayode Sakariyah Adewole¹,
Tao Han²,
Wanqing Wu^3,4,
Houbing Song⁵ &
…
Arun Kumar Sangaiah⁶

1905 Accesses
69 Citations
Explore all metrics

Abstract

Twitter social network has gained more popularity due to the increase in social activities of registered users. Twitter performs dual functions of online social network (OSN), acting as a microblogging OSN, and at the same time as a news update platform. Recently, the growth in Twitter social interactions has attracted the attention of cybercriminals. Spammers have used Twitter to spread malicious messages, post phishing links, flood the network with fake accounts, and engage in other malicious activities. The process of detecting the network of spammers who engage in these activities is an important step toward identifying individual spam account. Researchers have proposed a number of approaches to identify a group of spammers. However, each of these approaches addressed a specific category of spammer. This paper proposes a different approach to detect spammers on Twitter based on the similarities that exist among spam accounts. A number of features were introduced to improve the performance of the three classification algorithms selected in this study. The proposed approach applied principal component analysis and tuned K-means algorithm to cluster over 200,000 accounts, randomly selected from more than 2 million tweets to detect the clusters of spammers. Experimental results show that Random Forest achieved the highest accuracy of 96.30%. This result is followed by multilayer perceptron with 96.00% and support vector machine, which achieved 95.60%. The performance of the selected classifiers based on class imbalance also revealed that Random Forest achieved the highest accuracy, precision, recall, and F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Feature Selection for Spam Detection in Twitter

Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions

Article 06 January 2021

Evaluating the Impact of Features for Twitter Spammers Detection

References

Adewole KS, Anuar NB, Kamsin A, Varathan KD, Razak SA (2016) Malicious accounts: dark of the social networks. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2016.11.030
Article Google Scholar
Adikari S, Dutta K (2014) Identifying fake profiles in LinkedIn. In: PACIS
Aggarwal A, Rajadesingan A, Kumaraguru P (2012) PhishAri: Automatic realtime phishing detection on twitter. In: eCrime Researchers Summit (eCrime)
Ahmed F, Abulaish M (2012) An MCL-based approach for spam profile detection in online social networks. In: 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)
Ahmed F, Abulaish M (2013) A generic statistical approach for spam detection in Online Social Networks. Comput Commun 36(10–11):1120–1129. https://doi.org/10.1016/j.comcom.2013.04.004
Article Google Scholar
Aiyar S, Shetty NP (2018) N-gram assisted Youtube spam comment detection. Procedia Comput Sci 132:174–182
Article Google Scholar
Al-Qurishi M, Al-Rakhami M, Alamri A, Alrubaian M, Rahman SMM, Hossain MS (2017) Sybil defense techniques in online social networks: a survey. IEEE Access 5:1200–1219
Article Google Scholar
Almaatouq A, Shmueli E, Nouh M, Alabdulkareem A, Singh VK, Alsaleh M, Alfaris A (2016) If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts. Int J Inf Secur 15:475–491
Article Google Scholar
Alsaleh M, Alarifi A, Al-Salman AM, Alfayez M, Almuhaysin A (2014) TSD: detecting sybil accounts in Twitter. In: 2014 13th IEEE International Conference on Machine Learning and Applications (ICMLA)
Atluri AC, Tran V (2017) Botnets threat analysis and detection. In: Traoré I, Awad A, Woungang I (eds) Information security practices. Springer, Cham
Google Scholar
Avci E, Turkoglu I (2009) An intelligent diagnosis system based on principle component analysis and ANFIS for the heart valve diseases. Expert Syst Appl 36(2):2873–2878
Article Google Scholar
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: 7th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, CEAS 2010
Bhat SY, Abulaish M (2013) Community-based features for identifying spammers in online social networks. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Bhat SY, Abulaish M, Mirza AA (2014) Spammer classification using ensemble methods over structural social network features. In: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol 02
Chan PPK, Yang C, Yeung DS, Ng WWY (2014) Spam filtering for short messages in adversarial environment. Neurocomputing 155:167–176. https://doi.org/10.1016/j.neucom.2014.12.034
Article Google Scholar
Chen C-M, Guan D, Su Q-K (2014) Feature set identification for detecting suspicious URLs using Bayesian classification in social networks. Inf Sci 289:133–147
Article Google Scholar
Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of twitter accounts: Are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824. https://doi.org/10.1109/TDSC.2012.75
Article Google Scholar
Chu Z, Wang H, Widjaja I (2012) Detecting social spam campaigns on Twitter. In: Bao F, Samarati P, Zhou J (eds) Applied cryptography and network security. Lecture notes in computer science, vol 7341. Springer, Berlin
Google Scholar
Cresci S, Di Pietro R, Petrocchi M, Spognardi A, Tesconi M (2017) The paradigm-shift of social spambots: evidence, theories, and tools for the arms race. arXiv preprint arXiv:1701.03017
DMR (2014) Statistics of social networking sites. http://expandedramblings.com/index.php/resource-how-many-people-use-the-top-social-media
Do-Jong K, Yong-Woon P, Dong-Jo P (2001) A novel validity index for determination of the optimal number of clusters. IEICE Trans Inf Syst 84(2):281–285
Google Scholar
Echeverría J, Zhou S (2017) TheStar Wars’ botnet with > 350 k Twitter bots. arXiv preprint arXiv:1701.02405
Egele M, Stringhini G, Kruegel C, Vigna G (2015) Towards detecting compromised accounts on social networks. IEEE Tran Dependable Secure Comput. https://doi.org/10.1109/TDSC.2015.2479616
Article Google Scholar
Gani K, Hacid H, Skraba R (2012) Towards multiple identity detection in social networks. In: Proceedings of the 21st International Conference Companion on World Wide Web. ACM
Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement
Gao S, Ma X, Wang L, Yu Y (2016) Spammer detection based on comprehensive features in Sina Microblog. In: 2016 13th International Conference on Service Systems and Service Management (ICSSSM)
Ghosh S, Viswanath B, Kooti F, Sharma NK, Korlam G, Benevenuto F, Ganguly N, Gummadi KP (2012) Understanding and combating link farming in the twitter social network. In: Proceedings of the 21st International Conference World Wide Web, p 61
Google (2015) Google safe browsing API. Retrieved from 25 Nov 2015, http://code.google.com/apis/safebrowsing/
Grier C, Thomas K, Paxson V, Zhang M (2010) @spam: the underground on 140 characters or less. In: Proceedings of the 17th ACM Conference on Computer and Communications Security, pp 27–37
Iqbal F, Binsalleeh H, Fung BC, Debbabi M (2010) Mining writeprints from anonymous e-mails for forensic investigation. Digit Investig 7(1):56–64
Article Google Scholar
Kiliroor CC, Valliyammai C (2019) Social context based Naive Bayes filtering of spam messages from online social networks. In: Nayak J, Abraham A, Krishna B, Chandra SG, Das A (eds) Soft computing in data analytics. Springer, Singapore, pp 699–706
Chapter Google Scholar
Kim K-J, Ahn H (2008) A recommender system using GA K-means clustering in an online shopping market. Expert Syst Appl 34(2):1200–1209
Article Google Scholar
Lee S, Kim J (2014) Early filtering of ephemeral malicious accounts on Twitter. Comput Commun 54:48–57
Article Google Scholar
Lin P-C, Huang P-M (2013) A study of effective features for detecting long-surviving Twitter spam accounts. In: 2013 15th International Conference on Advanced Communications Technology (ICACT), p 841
Luckner M, Gad M, Sobkowiak P (2014) Stable web spam detection using features based on lexical items. Comput Secur 46:79–93. https://doi.org/10.1016/j.cose.2014.07.006
Article Google Scholar
Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40:2992–3000. https://doi.org/10.1016/j.eswa.2012.12.015
Article Google Scholar
Mccord M, Chuah M (2011) Spam detection on twitter using traditional classifiers. In: Calero JMA, Yang LT, Mármol FG, García Villalba LJ, Li AX, Wang Y (eds) Autonomic and trusted computing. Springer, Berlin, pp 175–186
Chapter Google Scholar
Meligy AM, Ibrahim HM, Torky MF (2017) Identity verification mechanism for detecting fake profiles in online social networks. Int J Comput Netw Inf Secur 9(1):31
Google Scholar
Muhammad K, Ahmad J, Rho S, Baik SW (2017) Image steganography for authenticity of visual contents in social networks. Multimed Tools Appl 76(18):18985–19004
Article Google Scholar
Muhammad K, Sajjad M, Mehmood I, Rho S, Baik SW (2016) Image steganography using uncorrelated color space and its application for security of visual contents in online social networks. Future Gener Comput Syst 86:951–960
Article Google Scholar
Narudin FA, Feizollah A, Anuar NB, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Computing 20(1):343–357
Article Google Scholar
Noriega L (2005) Multilayer perceptron tutorial. School of Computing, Staffordshire University, Staffordshire
Google Scholar
Nowakowska E, Koronacki J, Lipovetsky S (2016) Dimensionality reduction for data of unknown cluster structure. Inf Sci 330:74–87
Article Google Scholar
PhishTank (2015) Phishtank API. Retrieved from 25 Nov 2015, http://www.phishtank.com/
Principal Components Analysis (2009) Principal components: Mathematics, example, interpretation. http://www.stat.cmu.edu/~cshalizi/350/lectures/10/lecture-10.pdf
Quadri SA (2012) Feature extraction and selection methods & introduction to principal component analysis: a tutorial. http://www.slideshare.net/reachquadri/feature-extraction-and-principal-component-analysis
Rokach L, Maimon O (2005) Clustering methods. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, pp 321–352
Chapter Google Scholar
Sadan Z, Schwartz DG (2011) Social network analysis of web links to eliminate false positives in collaborative anti-spam systems. J Netw Comput Appl 34(5):1717–1723
Article Google Scholar
Shlens J (2014) A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100
Singh M, Bansal D, Sofat S (2014) Detecting malicious users in Twitter using classifiers. In: ACM International Conference Proceeding Series, p 247
Smith LI (2002) A tutorial on principal components analysis. Cornell University, USA, 51, 52
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Statista (2016) Leading social networks worldwide as of April 2016, ranked by number of active users (in millions). http://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/
Twitter (2016) The twitter rules. Retrieved from 28 Jan 2016, https://support.twitter.com/articles/18311
URIBL (2015) URIBL API. Retrieved from 25 Nov 2015, http://uribl.com/
Viswanath B, Bashir MA, Crovella M, Guha S, Gummadi KP, Krishnamurthy B, Mislove A (2014) Towards detecting anomalous user behavior in online social networks. In: Proceedings of the 23rd USENIX Security Symposium (USENIX Security)
Vorakitphan V, Leu F-Y, Fan Y-C (2018) Clickbait detection based on word embedding models. In: International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing
WEKA (2016) The University of Waikato. Retrieved from 2 Feb 2016, http://www.cs.waikato.ac.nz/ml/weka/
Wikipedia (2016) Determining the number of clusters in a data set. Retrieved from 24 Jan 2016, https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
Yang Z, Xue J, Yang X, Wang X, Dai Y (2015) VoteTrust: leveraging friend invitation graph to defend against social network Sybils. IEEE Trans Dependable Secure Comput. https://doi.org/10.1109/TDSC.2015.2410792
Article Google Scholar
Yi X, Zhang Y (2013) Equally contributory privacy-preserving k-means clustering over vertically partitioned data. Inf Syst 38(1):97–107
Article Google Scholar
Yoon JW, Kim H, Huh JH (2010) Hybrid spam filtering for mobile communication. Comput Secur 29(4):446–459. https://doi.org/10.1016/j.cose.2009.11.003
Article Google Scholar
Zhang X, Zhu S, Liang W (2012) Detecting spam and promoting campaigns in the Twitter social network. In: 2012 IEEE 12th International Conference on Data Mining
Zheng X, Zeng Z, Chen Z, Yu Y, Rong C (2015) Detecting spammers on social networks. Neurocomputing 159:27–34. https://doi.org/10.1016/j.neucom.2015.02.047
Article Google Scholar

Download references

Acknowledgements

This work is funded by the Nigerian Tertiary Education Trust Fund (TETFund).

Author information

Authors and Affiliations

Faculty of Communication and Information Sciences, University of Ilorin, Ilorin, Nigeria
Kayode Sakariyah Adewole
DGUT-CNAM Institute, Dongguan University of Technology, Dongguan, Guangdong Province, People’s Republic of China
Tao Han
CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology (SIAT), Shenzhen, 518055, China
Wanqing Wu
Institute of Biomedical and Health Engineering, SIAT, Chinese Academy of Sciences (CAS), Shenzhen, 518055, China
Wanqing Wu
Department of Electrical, Computer, Software, and Systems Engineering, Embry-Riddle Aeronautical University, Daytona Beach, FL, 32114, USA
Houbing Song
School of Computing Science and Engineering, Vellore Institute of Technology, Vellore, 632014, India
Arun Kumar Sangaiah

Authors

Kayode Sakariyah Adewole
View author publications
You can also search for this author inPubMed Google Scholar
Tao Han
View author publications
You can also search for this author inPubMed Google Scholar
Wanqing Wu
View author publications
You can also search for this author inPubMed Google Scholar
Houbing Song
View author publications
You can also search for this author inPubMed Google Scholar
Arun Kumar Sangaiah
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Arun Kumar Sangaiah.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Ethical approval

This research does not involve human or animal; however, it requires data collection from Twitter social network with privacy policy. This research fully complied with the privacy policy of Twitter by following the Twitter approved procedures for data collection using OAuth authentication. In addition, we do not release the data collected from Twitter to any researcher. The identity of the individual accounts in the data is anonymized.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Adewole, K.S., Han, T., Wu, W. et al. Twitter spam account detection based on clustering and classification methods. J Supercomput 76, 4802–4837 (2020). https://doi.org/10.1007/s11227-018-2641-x

Download citation

Published: 10 October 2018
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11227-018-2641-x

Keywords

Profiles

Kayode Sakariyah Adewole View author profile
Arun Kumar Sangaiah View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Twitter spam account detection based on clustering and classification methods

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Feature Selection for Spam Detection in Twitter

Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions

Evaluating the Impact of Features for Twitter Spammers Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now