Abstract
Spam e-mails are considered a serious violation of privacy. It has become costly and unwanted communication. Support vector machine (SVM) has been widely used in e-mail spam classification, yet the problem of dealing with huge amounts of data results in low accuracy and time consumption as many researches have demonstrated. This paper proposes a hybrid approach for e-mail spam classification based on the SVM and \(k\)-mean clustering. The experiment of the proposed approach was carried out using spambase standard dataset to evaluate the feasibility of the proposed method. The result of this combination led to improve SVM and accordingly increase the accuracy of spam classification. The accuracy based on SVM algorithm is 96.30 % and the proposed hybrid SVM algorithm with \(k\)-mean clustering is 98.01 %. In addition, experimental results on spambase datasets showed that improved SVM (ESVM) significantly outperforms SVM and many other recent spam classification methods.
Similar content being viewed by others
References
Alguliev RM, Aliguliyev RM, Nazirova SA (2011) Classification of textual e-mail spam using data mining techniques. Appl Comput Intell Soft Comput 2011:1–8 Art. ID 416308
Alguliyev R, Nazirova S (2012) Two approaches on implementation of CBR and CRM technologies to the spam filtering problem. Inf J
Castiglione A et al (2012) An asynchronous covert channel using spam. Comput Math Appl 63(2):437–447
Chhabra P, Wadhvani R, Shukla S (2010) Spam filtering using support vector machine. In: ACCTA-2010, pp 166–171
DeBarr D, Wechsler H (2009) Spam detection using clustering, random forests, and active learning. In: CEAS 2009, California, USA
Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. Neural Netw IEEE Trans 10(5):1048–1054
Golovko V et al (2010) Neural network and artificial immune systems for malware and network intrusion detection. In: Proccedings of advances in machine learning II, pp 485–513
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222
Hayati P, Potdar V (2008) Evaluation of spam detection and prevention frameworks for email and image spam: a state of art. In: Proceedings of ACM
Hopkins M et al (1999) Spambase dataset. https://archive.ics.usci.edu/ml/datasets/spambase
Idris I (2011) E-mail spam classification with artificial neural network and negative selection algorithm. Int J Comput Sci 1(3):227–231
Idris I (2012a) Model and algorithm in artificial immune system for spam detection. Int J 3(1):83–94
Idris I (2012b) Optimized spam classification approach with negative selection algorithm. J Theor Appl Inf Technol 39(1):22–31
Jin Q, Ming M (2011) A method to construct self set for IDS based on negative selection algorithm. In: Proceedings of IEEE
Lai CC, Wu CH (2007) Particle swarm optimization-aided feature selection for spam email classification. In: Proceedings of IEEE
Lee SM et al (2010) Spam detection using feature selection and parameters optimization. In: Proceedings of IEEE
Long X, Cleveland WL, Yao YL (2011) Methods and systems for identifying and localizing objects based on features of the objects that are mapped to a vector, Google patents
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. California, USA
Marsono MN (2007) Towards improving e-mail content classification for spam control: architecture, abstraction, and strategies. PhD Thesis, University of Victoria
Ma W, Tran D, Sharma D (2009) A novel spam email detection system based on negative selection. In: Proceedings of IEEE
Mazid MM, Ali ABMS, Tickle KS (2010) Improved C4.5 algorithm for rule based classification recent advances in artificial intelligence, knowledge engineering and data bases
Mohammad AH, Zitar RA (2011) Application of genetic optimized artificial immune system and neural networks in spam detection. Appl Soft Comput 11(4):3827–3845
Morariu DI, Vintan LN, Tresp V (2006) Evolutionary feature selection for text documents using the SVM. Trans Eng Comput Tech 15:215–221
Münz G, Li S, Carle G (2007) Traffic anomaly detection using k-means clustering
Naksomboon S, Charnsripinyo C, Wattanapongsakorn N (2010) Considering behavior of sender in spam mail detection. In: Proceedings of 6th international conference on networked computing (INC)
Noble WS (2006) What is a support vector machine? Nature Biotechnol 24(12):1565–1567
Nosrati L, Pour AN (2011) DWM-CDD: dynamic weighted majority concept drift detection for spam mail filtering world academy of science. Eng Technol 80:2011
Palmieri F et al (2013) On the detection of card-sharing traffic through wavelet analysis and support vector machines. Appl Soft Comput 13(1):615–627
Palmieri F, Fiore U, Castiglione A (2014) A distributed approach to network anomaly detection based on independent component analysis. Concurr Comput Pract Exp 26(5):1113–1129
Pearson K (1920) Notes on the history of correlation. Biometrika 13(1):25–45
Radicati S, Khmartseva M (2009) Email statistics report, 2009–2013 May. Radicati Group. www.radicati.com/wp/wp-content/uploads/2009/05/email-stats-report-exec-summary.pdf. Accessed 5 Mar 2010)
Rao IKR (2003) Data mining and clustering techniques
Raskar SS, Thakore D (2011) Text mining and clustering analysis. IJCSNS 11(6):203
Saad O, Darwish A, Faraj R (2012) A survey of machine learning techniques for Spam filtering. IJCSNS 12(2):66
Salcedo-Campos F, Díaz-Verdejo J, García-Teodoro P (2012) Segmental parameterisation and statistical modelling of e-mail headers for spam detection. Inf Sci 195:45–61
Salehi S, Selamat A (2011) Hybrid simple artificial immune system (SAIS) and particle swarm optimization (PSO) for spam detection. In: Proceedings of IEEE
Sun J et al (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. Neural Netw IEEE Trans 21(2):305–318
Tafazzoli T, Sadjadi SH (2009) A combined method for detecting spam machines on a target network. Int J Comput Netw Commun (IJCNC) 1(2):35–44
Temitayo F, Stephen O, Abimbola A (2012) Hybrid GA-SVM for efficient feature selection in e-mail classification. Comput Eng Intell Syst 3(3):17–28
Torres GJ, Basnet RB, Sung AH, Mukkamala S, Ribero BM (2009) A similarity measure for clustering and its applications. Int J Electr Comput Syst Eng 3(3):164–170
Vinther M (2002) Intelligent junk mail detection using neural networks. http://www.logicnet.dk/reports/JunkDetection/JunkDetection.pdf
Wang L (2005) Support vector machines: theory and applications. vol. 177, pp 1–47. Springer, Auckland, New Zealand
Wang X, Cloete I (2005) Learning to classify email: a survey. In: Proceedings of IEEE
Wu X et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Wu CH (2009) Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Syst Appl 36(3):4321–4330
Xie Y et al (2008) Spamming botnets: signatures and characteristics. In: Proceedings of ACM
Youn S, McLeod D (2007) A comparative study for email classification. Computing Sciences and Software Engineering, Advances and Innovations in Systems, pp 387–391
Yu B, Xu Z (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl Based Syst 21(4):355–362
Zhang Q et al (2011) Fuzzy clustering based on semantic body and its application in Chinese spam filtering. JDCTA: Int J Digital Content Technol Appl 5(4):1–11
Acknowledgments
This work was financially supported in part by IDF in Universiti Teknologi Malaysia. The authors would like to thank the Research Management Centre (RMC) Universiti Teknologi Malaysia and Algraf Technical College for their support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Elssied, N.O.F., Ibrahim, O. & Osman, A.H. Enhancement of spam detection mechanism based on hybrid \(\varvec{k}\)-mean clustering and support vector machine. Soft Comput 19, 3237–3248 (2015). https://doi.org/10.1007/s00500-014-1479-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1479-2