Skip to main content
Log in

Enhancement of spam detection mechanism based on hybrid \(\varvec{k}\)-mean clustering and support vector machine

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Spam e-mails are considered a serious violation of privacy. It has become costly and unwanted communication. Support vector machine (SVM) has been widely used in e-mail spam classification, yet the problem of dealing with huge amounts of data results in low accuracy and time consumption as many researches have demonstrated. This paper proposes a hybrid approach for e-mail spam classification based on the SVM and \(k\)-mean clustering. The experiment of the proposed approach was carried out using spambase standard dataset to evaluate the feasibility of the proposed method. The result of this combination led to improve SVM and accordingly increase the accuracy of spam classification. The accuracy based on SVM algorithm is 96.30 % and the proposed hybrid SVM algorithm with \(k\)-mean clustering is 98.01 %. In addition, experimental results on spambase datasets showed that improved SVM (ESVM) significantly outperforms SVM and many other recent spam classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Alguliev RM, Aliguliyev RM, Nazirova SA (2011) Classification of textual e-mail spam using data mining techniques. Appl Comput Intell Soft Comput 2011:1–8 Art. ID 416308

  • Alguliyev R, Nazirova S (2012) Two approaches on implementation of CBR and CRM technologies to the spam filtering problem. Inf J

  • Castiglione A et al (2012) An asynchronous covert channel using spam. Comput Math Appl 63(2):437–447

    Article  Google Scholar 

  • Chhabra P, Wadhvani R, Shukla S (2010) Spam filtering using support vector machine. In: ACCTA-2010, pp 166–171

  • DeBarr D, Wechsler H (2009) Spam detection using clustering, random forests, and active learning. In: CEAS 2009, California, USA

  • Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. Neural Netw IEEE Trans 10(5):1048–1054

    Article  Google Scholar 

  • Golovko V et al (2010) Neural network and artificial immune systems for malware and network intrusion detection. In: Proccedings of advances in machine learning II, pp 485–513

  • Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222

    Article  Google Scholar 

  • Hayati P, Potdar V (2008) Evaluation of spam detection and prevention frameworks for email and image spam: a state of art. In: Proceedings of ACM

  • Hopkins M et al (1999) Spambase dataset. https://archive.ics.usci.edu/ml/datasets/spambase

  • Idris I (2011) E-mail spam classification with artificial neural network and negative selection algorithm. Int J Comput Sci 1(3):227–231

  • Idris I (2012a) Model and algorithm in artificial immune system for spam detection. Int J 3(1):83–94

  • Idris I (2012b) Optimized spam classification approach with negative selection algorithm. J Theor Appl Inf Technol 39(1):22–31

  • Jin Q, Ming M (2011) A method to construct self set for IDS based on negative selection algorithm. In: Proceedings of IEEE

  • Lai CC, Wu CH (2007) Particle swarm optimization-aided feature selection for spam email classification. In: Proceedings of IEEE

  • Lee SM et al (2010) Spam detection using feature selection and parameters optimization. In: Proceedings of IEEE

  • Long X, Cleveland WL, Yao YL (2011) Methods and systems for identifying and localizing objects based on features of the objects that are mapped to a vector, Google patents

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. California, USA

  • Marsono MN (2007) Towards improving e-mail content classification for spam control: architecture, abstraction, and strategies. PhD Thesis, University of Victoria

  • Ma W, Tran D, Sharma D (2009) A novel spam email detection system based on negative selection. In: Proceedings of IEEE

  • Mazid MM, Ali ABMS, Tickle KS (2010) Improved C4.5 algorithm for rule based classification recent advances in artificial intelligence, knowledge engineering and data bases

  • Mohammad AH, Zitar RA (2011) Application of genetic optimized artificial immune system and neural networks in spam detection. Appl Soft Comput 11(4):3827–3845

    Article  Google Scholar 

  • Morariu DI, Vintan LN, Tresp V (2006) Evolutionary feature selection for text documents using the SVM. Trans Eng Comput Tech 15:215–221

  • Münz G, Li S, Carle G (2007) Traffic anomaly detection using k-means clustering

  • Naksomboon S, Charnsripinyo C, Wattanapongsakorn N (2010) Considering behavior of sender in spam mail detection. In: Proceedings of 6th international conference on networked computing (INC)

  • Noble WS (2006) What is a support vector machine? Nature Biotechnol 24(12):1565–1567

    Article  MathSciNet  Google Scholar 

  • Nosrati L, Pour AN (2011) DWM-CDD: dynamic weighted majority concept drift detection for spam mail filtering world academy of science. Eng Technol 80:2011

  • Palmieri F et al (2013) On the detection of card-sharing traffic through wavelet analysis and support vector machines. Appl Soft Comput 13(1):615–627

    Article  Google Scholar 

  • Palmieri F, Fiore U, Castiglione A (2014) A distributed approach to network anomaly detection based on independent component analysis. Concurr Comput Pract Exp 26(5):1113–1129

  • Pearson K (1920) Notes on the history of correlation. Biometrika 13(1):25–45

    Article  Google Scholar 

  • Radicati S, Khmartseva M (2009) Email statistics report, 2009–2013 May. Radicati Group. www.radicati.com/wp/wp-content/uploads/2009/05/email-stats-report-exec-summary.pdf. Accessed 5 Mar 2010)

  • Rao IKR (2003) Data mining and clustering techniques

  • Raskar SS, Thakore D (2011) Text mining and clustering analysis. IJCSNS 11(6):203

    Google Scholar 

  • Saad O, Darwish A, Faraj R (2012) A survey of machine learning techniques for Spam filtering. IJCSNS 12(2):66

    Google Scholar 

  • Salcedo-Campos F, Díaz-Verdejo J, García-Teodoro P (2012) Segmental parameterisation and statistical modelling of e-mail headers for spam detection. Inf Sci 195:45–61

    Article  Google Scholar 

  • Salehi S, Selamat A (2011) Hybrid simple artificial immune system (SAIS) and particle swarm optimization (PSO) for spam detection. In: Proceedings of IEEE

  • Sun J et al (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. Neural Netw IEEE Trans 21(2):305–318

    Article  Google Scholar 

  • Tafazzoli T, Sadjadi SH (2009) A combined method for detecting spam machines on a target network. Int J Comput Netw Commun (IJCNC) 1(2):35–44

  • Temitayo F, Stephen O, Abimbola A (2012) Hybrid GA-SVM for efficient feature selection in e-mail classification. Comput Eng Intell Syst 3(3):17–28

    Google Scholar 

  • Torres GJ, Basnet RB, Sung AH, Mukkamala S, Ribero BM (2009) A similarity measure for clustering and its applications. Int J Electr Comput Syst Eng 3(3):164–170

  • Vinther M (2002) Intelligent junk mail detection using neural networks. http://www.logicnet.dk/reports/JunkDetection/JunkDetection.pdf

  • Wang L (2005) Support vector machines: theory and applications. vol. 177, pp 1–47. Springer, Auckland, New Zealand

  • Wang X, Cloete I (2005) Learning to classify email: a survey. In: Proceedings of IEEE

  • Wu X et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Wu CH (2009) Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Syst Appl 36(3):4321–4330

    Article  Google Scholar 

  • Xie Y et al (2008) Spamming botnets: signatures and characteristics. In: Proceedings of ACM

  • Youn S, McLeod D (2007) A comparative study for email classification. Computing Sciences and Software Engineering, Advances and Innovations in Systems, pp 387–391

  • Yu B, Xu Z (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl Based Syst 21(4):355–362

    Article  Google Scholar 

  • Zhang Q et al (2011) Fuzzy clustering based on semantic body and its application in Chinese spam filtering. JDCTA: Int J Digital Content Technol Appl 5(4):1–11

Download references

Acknowledgments

This work was financially supported in part by IDF in Universiti Teknologi Malaysia. The authors would like to thank the Research Management Centre (RMC) Universiti Teknologi Malaysia and Algraf Technical College for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nadir Omer Fadl Elssied.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elssied, N.O.F., Ibrahim, O. & Osman, A.H. Enhancement of spam detection mechanism based on hybrid \(\varvec{k}\)-mean clustering and support vector machine. Soft Comput 19, 3237–3248 (2015). https://doi.org/10.1007/s00500-014-1479-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1479-2

Keywords

Navigation