Skip to main content

Advertisement

Log in

Twitter spam account detection based on clustering and classification methods

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Twitter social network has gained more popularity due to the increase in social activities of registered users. Twitter performs dual functions of online social network (OSN), acting as a microblogging OSN, and at the same time as a news update platform. Recently, the growth in Twitter social interactions has attracted the attention of cybercriminals. Spammers have used Twitter to spread malicious messages, post phishing links, flood the network with fake accounts, and engage in other malicious activities. The process of detecting the network of spammers who engage in these activities is an important step toward identifying individual spam account. Researchers have proposed a number of approaches to identify a group of spammers. However, each of these approaches addressed a specific category of spammer. This paper proposes a different approach to detect spammers on Twitter based on the similarities that exist among spam accounts. A number of features were introduced to improve the performance of the three classification algorithms selected in this study. The proposed approach applied principal component analysis and tuned K-means algorithm to cluster over 200,000 accounts, randomly selected from more than 2 million tweets to detect the clusters of spammers. Experimental results show that Random Forest achieved the highest accuracy of 96.30%. This result is followed by multilayer perceptron with 96.00% and support vector machine, which achieved 95.60%. The performance of the selected classifiers based on class imbalance also revealed that Random Forest achieved the highest accuracy, precision, recall, and F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Adewole KS, Anuar NB, Kamsin A, Varathan KD, Razak SA (2016) Malicious accounts: dark of the social networks. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2016.11.030

    Article  Google Scholar 

  2. Adikari S, Dutta K (2014) Identifying fake profiles in LinkedIn. In: PACIS

  3. Aggarwal A, Rajadesingan A, Kumaraguru P (2012) PhishAri: Automatic realtime phishing detection on twitter. In: eCrime Researchers Summit (eCrime)

  4. Ahmed F, Abulaish M (2012) An MCL-based approach for spam profile detection in online social networks. In: 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)

  5. Ahmed F, Abulaish M (2013) A generic statistical approach for spam detection in Online Social Networks. Comput Commun 36(10–11):1120–1129. https://doi.org/10.1016/j.comcom.2013.04.004

    Article  Google Scholar 

  6. Aiyar S, Shetty NP (2018) N-gram assisted Youtube spam comment detection. Procedia Comput Sci 132:174–182

    Article  Google Scholar 

  7. Al-Qurishi M, Al-Rakhami M, Alamri A, Alrubaian M, Rahman SMM, Hossain MS (2017) Sybil defense techniques in online social networks: a survey. IEEE Access 5:1200–1219

    Article  Google Scholar 

  8. Almaatouq A, Shmueli E, Nouh M, Alabdulkareem A, Singh VK, Alsaleh M, Alfaris A (2016) If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts. Int J Inf Secur 15:475–491

    Article  Google Scholar 

  9. Alsaleh M, Alarifi A, Al-Salman AM, Alfayez M, Almuhaysin A (2014) TSD: detecting sybil accounts in Twitter. In: 2014 13th IEEE International Conference on Machine Learning and Applications (ICMLA)

  10. Atluri AC, Tran V (2017) Botnets threat analysis and detection. In: Traoré I, Awad A, Woungang I (eds) Information security practices. Springer, Cham

    Google Scholar 

  11. Avci E, Turkoglu I (2009) An intelligent diagnosis system based on principle component analysis and ANFIS for the heart valve diseases. Expert Syst Appl 36(2):2873–2878

    Article  Google Scholar 

  12. Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: 7th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, CEAS 2010

  13. Bhat SY, Abulaish M (2013) Community-based features for identifying spammers in online social networks. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

  14. Bhat SY, Abulaish M, Mirza AA (2014) Spammer classification using ensemble methods over structural social network features. In: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol 02

  15. Chan PPK, Yang C, Yeung DS, Ng WWY (2014) Spam filtering for short messages in adversarial environment. Neurocomputing 155:167–176. https://doi.org/10.1016/j.neucom.2014.12.034

    Article  Google Scholar 

  16. Chen C-M, Guan D, Su Q-K (2014) Feature set identification for detecting suspicious URLs using Bayesian classification in social networks. Inf Sci 289:133–147

    Article  Google Scholar 

  17. Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of twitter accounts: Are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824. https://doi.org/10.1109/TDSC.2012.75

    Article  Google Scholar 

  18. Chu Z, Wang H, Widjaja I (2012) Detecting social spam campaigns on Twitter. In: Bao F, Samarati P, Zhou J (eds) Applied cryptography and network security. Lecture notes in computer science, vol 7341. Springer, Berlin

    Google Scholar 

  19. Cresci S, Di Pietro R, Petrocchi M, Spognardi A, Tesconi M (2017) The paradigm-shift of social spambots: evidence, theories, and tools for the arms race. arXiv preprint arXiv:1701.03017

  20. DMR (2014) Statistics of social networking sites. http://expandedramblings.com/index.php/resource-how-many-people-use-the-top-social-media

  21. Do-Jong K, Yong-Woon P, Dong-Jo P (2001) A novel validity index for determination of the optimal number of clusters. IEICE Trans Inf Syst 84(2):281–285

    Google Scholar 

  22. Echeverría J, Zhou S (2017) TheStar Wars’ botnet with > 350 k Twitter bots. arXiv preprint arXiv:1701.02405

  23. Egele M, Stringhini G, Kruegel C, Vigna G (2015) Towards detecting compromised accounts on social networks. IEEE Tran Dependable Secure Comput. https://doi.org/10.1109/TDSC.2015.2479616

    Article  Google Scholar 

  24. Gani K, Hacid H, Skraba R (2012) Towards multiple identity detection in social networks. In: Proceedings of the 21st International Conference Companion on World Wide Web. ACM

  25. Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement

  26. Gao S, Ma X, Wang L, Yu Y (2016) Spammer detection based on comprehensive features in Sina Microblog. In: 2016 13th International Conference on Service Systems and Service Management (ICSSSM)

  27. Ghosh S, Viswanath B, Kooti F, Sharma NK, Korlam G, Benevenuto F, Ganguly N, Gummadi KP (2012) Understanding and combating link farming in the twitter social network. In: Proceedings of the 21st International Conference World Wide Web, p 61

  28. Google (2015) Google safe browsing API. Retrieved from 25 Nov 2015, http://code.google.com/apis/safebrowsing/

  29. Grier C, Thomas K, Paxson V, Zhang M (2010) @spam: the underground on 140 characters or less. In: Proceedings of the 17th ACM Conference on Computer and Communications Security, pp 27–37

  30. Iqbal F, Binsalleeh H, Fung BC, Debbabi M (2010) Mining writeprints from anonymous e-mails for forensic investigation. Digit Investig 7(1):56–64

    Article  Google Scholar 

  31. Kiliroor CC, Valliyammai C (2019) Social context based Naive Bayes filtering of spam messages from online social networks. In: Nayak J, Abraham A, Krishna B, Chandra SG, Das A (eds) Soft computing in data analytics. Springer, Singapore, pp 699–706

    Chapter  Google Scholar 

  32. Kim K-J, Ahn H (2008) A recommender system using GA K-means clustering in an online shopping market. Expert Syst Appl 34(2):1200–1209

    Article  Google Scholar 

  33. Lee S, Kim J (2014) Early filtering of ephemeral malicious accounts on Twitter. Comput Commun 54:48–57

    Article  Google Scholar 

  34. Lin P-C, Huang P-M (2013) A study of effective features for detecting long-surviving Twitter spam accounts. In: 2013 15th International Conference on Advanced Communications Technology (ICACT), p 841

  35. Luckner M, Gad M, Sobkowiak P (2014) Stable web spam detection using features based on lexical items. Comput Secur 46:79–93. https://doi.org/10.1016/j.cose.2014.07.006

    Article  Google Scholar 

  36. Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40:2992–3000. https://doi.org/10.1016/j.eswa.2012.12.015

    Article  Google Scholar 

  37. Mccord M, Chuah M (2011) Spam detection on twitter using traditional classifiers. In: Calero JMA, Yang LT, Mármol FG, García Villalba LJ, Li AX, Wang Y (eds) Autonomic and trusted computing. Springer, Berlin, pp 175–186

    Chapter  Google Scholar 

  38. Meligy AM, Ibrahim HM, Torky MF (2017) Identity verification mechanism for detecting fake profiles in online social networks. Int J Comput Netw Inf Secur 9(1):31

    Google Scholar 

  39. Muhammad K, Ahmad J, Rho S, Baik SW (2017) Image steganography for authenticity of visual contents in social networks. Multimed Tools Appl 76(18):18985–19004

    Article  Google Scholar 

  40. Muhammad K, Sajjad M, Mehmood I, Rho S, Baik SW (2016) Image steganography using uncorrelated color space and its application for security of visual contents in online social networks. Future Gener Comput Syst 86:951–960

    Article  Google Scholar 

  41. Narudin FA, Feizollah A, Anuar NB, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Computing 20(1):343–357

    Article  Google Scholar 

  42. Noriega L (2005) Multilayer perceptron tutorial. School of Computing, Staffordshire University, Staffordshire

    Google Scholar 

  43. Nowakowska E, Koronacki J, Lipovetsky S (2016) Dimensionality reduction for data of unknown cluster structure. Inf Sci 330:74–87

    Article  Google Scholar 

  44. PhishTank (2015) Phishtank API. Retrieved from 25 Nov 2015, http://www.phishtank.com/

  45. Principal Components Analysis (2009) Principal components: Mathematics, example, interpretation. http://www.stat.cmu.edu/~cshalizi/350/lectures/10/lecture-10.pdf

  46. Quadri SA (2012) Feature extraction and selection methods & introduction to principal component analysis: a tutorial. http://www.slideshare.net/reachquadri/feature-extraction-and-principal-component-analysis

  47. Rokach L, Maimon O (2005) Clustering methods. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, pp 321–352

    Chapter  Google Scholar 

  48. Sadan Z, Schwartz DG (2011) Social network analysis of web links to eliminate false positives in collaborative anti-spam systems. J Netw Comput Appl 34(5):1717–1723

    Article  Google Scholar 

  49. Shlens J (2014) A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100

  50. Singh M, Bansal D, Sofat S (2014) Detecting malicious users in Twitter using classifiers. In: ACM International Conference Proceeding Series, p 247

  51. Smith LI (2002) A tutorial on principal components analysis. Cornell University, USA, 51, 52

  52. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  MathSciNet  Google Scholar 

  53. Statista (2016) Leading social networks worldwide as of April 2016, ranked by number of active users (in millions). http://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/

  54. Twitter (2016) The twitter rules. Retrieved from 28 Jan 2016, https://support.twitter.com/articles/18311

  55. URIBL (2015) URIBL API. Retrieved from 25 Nov 2015, http://uribl.com/

  56. Viswanath B, Bashir MA, Crovella M, Guha S, Gummadi KP, Krishnamurthy B, Mislove A (2014) Towards detecting anomalous user behavior in online social networks. In: Proceedings of the 23rd USENIX Security Symposium (USENIX Security)

  57. Vorakitphan V, Leu F-Y, Fan Y-C (2018) Clickbait detection based on word embedding models. In: International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing

  58. WEKA (2016) The University of Waikato. Retrieved from 2 Feb 2016, http://www.cs.waikato.ac.nz/ml/weka/

  59. Wikipedia (2016) Determining the number of clusters in a data set. Retrieved from 24 Jan 2016, https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

  60. Yang Z, Xue J, Yang X, Wang X, Dai Y (2015) VoteTrust: leveraging friend invitation graph to defend against social network Sybils. IEEE Trans Dependable Secure Comput. https://doi.org/10.1109/TDSC.2015.2410792

    Article  Google Scholar 

  61. Yi X, Zhang Y (2013) Equally contributory privacy-preserving k-means clustering over vertically partitioned data. Inf Syst 38(1):97–107

    Article  Google Scholar 

  62. Yoon JW, Kim H, Huh JH (2010) Hybrid spam filtering for mobile communication. Comput Secur 29(4):446–459. https://doi.org/10.1016/j.cose.2009.11.003

    Article  Google Scholar 

  63. Zhang X, Zhu S, Liang W (2012) Detecting spam and promoting campaigns in the Twitter social network. In: 2012 IEEE 12th International Conference on Data Mining

  64. Zheng X, Zeng Z, Chen Z, Yu Y, Rong C (2015) Detecting spammers on social networks. Neurocomputing 159:27–34. https://doi.org/10.1016/j.neucom.2015.02.047

    Article  Google Scholar 

Download references

Acknowledgements

This work is funded by the Nigerian Tertiary Education Trust Fund (TETFund).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arun Kumar Sangaiah.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Ethical approval

This research does not involve human or animal; however, it requires data collection from Twitter social network with privacy policy. This research fully complied with the privacy policy of Twitter by following the Twitter approved procedures for data collection using OAuth authentication. In addition, we do not release the data collected from Twitter to any researcher. The identity of the individual accounts in the data is anonymized.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Adewole, K.S., Han, T., Wu, W. et al. Twitter spam account detection based on clustering and classification methods. J Supercomput 76, 4802–4837 (2020). https://doi.org/10.1007/s11227-018-2641-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2641-x

Keywords

Profiles

  1. Kayode Sakariyah Adewole
  2. Arun Kumar Sangaiah