Skip to main content
Log in

Visual and textual features based email spam classification using S-Cuckoo search and hybrid kernel support vector machine

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Spam mail classification has been playing a vital role in recent days due to the uncontrollable growth happening in the electronic media. Literature presents several algorithms for email spam classification based on classification methods. In this paper, we propose a spam classification framework using S-Cuckoo and hybrid kernel based support vector machine (HKSVM). At first, the features are extracted from the e-mails based on the text as well as the image. For the textual features, TF-term frequency is used. For the image dependent features, correrlogram and wavelet moment are taken. The hybrid features have then high dimension so the optimum features are identified with the help of hybrid algorithm, called S-Cuckoo search. Then, the classification is done using proposed classifier HKSVM model which is designed based on the hybrid kernel by blending three different kernel functions and then it is used in the SVM classifier. The additional features provided based on image and the modification of SVM classifier provides significant improvement as compared with existing algorithms. The spam classification performance is measured by db1 (combining bare-ling spam and Spam Archive corpus) and db2 (combining lemm-ling spam and Spam Archive corpus). Experimental results show that the proposed spam classification framework has outperformed by having better accuracy of 97.235% when compared with existing approach which is able to achieve only 94.117%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Wang, M.F., Jheng, S.L., Tsai, M.F., Tang, C.H.: Enterprise email classification based on social network features. In: International Conference on Advances in Social Networks Analysis and Mining, pp. 532–536. (2011)

  2. Kumar, R.K., Poonkuzhali, G., Sudhakar, P.: Comparative study on email spam classifier using data mining techniques. In: Proceeding of the IMECS, vol. 1, (2012)

  3. Islam, M.R., Chowdhury, M.U., Zhou, W.: An innovative spam filtering model based on support vector machine. Computational Intelligence for Modeling, Control and Automation 2, 349–353 (2005)

    Google Scholar 

  4. Chiu, C.Y., Huang, Y.T.: Integration of support vector machine with naive bayesian classifier for spam classification. In: 4th International Conference on Fuzzy System and Knowledge Discovery, vol. 1, pp. 618–622 (2007)

  5. Islam M.R., Choudhury, M.U.: Dynamic feature selection for spam filtering using support vector machine. International Conference on Computer and Information Science, pp. 757–762 (2007)

  6. Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to filter unsolicited commercial email. NCRS. Technical Report (2), (2004)

  7. Moustakas, E., Ranganathan, C., Duquenoy, P.: Combating spam through legislation: a comparative analysis of US and European approaches. In: Proceedings of 2nd Conference on Email and Anti-Spam, CEAS (2005)

  8. Acır, N., Özdamar, Ö., Güzeliş, C.: Automatic classification of auditory brainstem responses using SVM-based feature selection algorithm for threshold detection. Eng. Appl. Artif. Intell. 19, 209–218 (2006)

    Article  Google Scholar 

  9. Valentini, G., Muselli, M., Ruffino, F.: Cancer recognition with bagged ensembles of support vector machines. Neurocomputing 56, 461–466 (2004)

    Article  Google Scholar 

  10. Zhang, Y.L., Guo, N., Du, H., Li, W.H.: Automated defect recognition of C-SAM images in IC packaging using Support Vector Machines. Int. J. Adv. Manuf. Technol. 25, 1191–1196 (2005)

    Article  Google Scholar 

  11. Araújo, T., Aresta, G., Castro, E., Rouco, J., Aguiar, P., Eloy, C., et al.: Classification of breast cancer histology images using Convolutional Neural Networks. PLoS ONE 12(6), e0177544 (2017)

    Article  Google Scholar 

  12. Huang, M.-W., Chen, C.-W., Lin, W.-C., Ke, S.-W., Tsai, C.-F.: SVM and SVM Ensembles in breast cancer prediction. PLoS ONE 12(1), e0161501 (2017)

    Article  Google Scholar 

  13. Wu, C.H.: Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Syst. Appl. 36, 4321–4330 (2009)

    Article  Google Scholar 

  14. Bezerra, G.B., Barra, T.V., Ferreira, H.M., et al.: An immunological filter for spam. In: Proceedings of the International Conference on Artificial Immune Systems, Oeiras, Portugal, pp. 446–458. (2006)

  15. Wang, F., You, Z., Man, L.: Immune-based peer-to-peer model for anti-spam. In: Proceedings of the International Conference on Intelligent Computing, Kunming, China, pp. 660–671. (2006)

  16. Moon, J., Shon, T., Seo, J., et al.: An approach for spam e-mail detection with support vector machine and n-gram indexing. In: Proceedings of International Symposium on Computer and Information Sciences, Springer, Antalya, pp. 351–362. (2004)

  17. Wang, H.B., Yu, Y., Liu, Z.: SVM classifier incorporating feature selection using GA for spam detection. In: Proceedings of the 2005 International Conference on Embedded And Ubiquitous Computing, Nagasaki, Japan, pp. 1147–1154. (2005)

  18. Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Neural recognition and genetic features selection for robust detection of e-mail spam. In: Proceedings of the 4th Helenic Conference on AI, Crete, Greece, pp. 498–501. (2006)

  19. Marsono, M.N., El-Kharashi, M.W., Gebali, F.: Binary lns-based naive bayes hardware classifier for spam control. In: Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, pp. 3674–3677. (2006)

  20. Crawford, E., Koprinska, I., Patrick, J.: Phrases and feature selection in e-mail classification. In: Proceedings of the 9th Australasian Document Computing Symposium, Melbourne, Australia (2004)

  21. Delany, S.J., Cunningham, P., Tsymbal, A., et al.: A case-based technique for tracking concept drift in spam filtering. Knowl. Based Syst. 18(4–5), 187–195 (2005)

    Article  Google Scholar 

  22. Carpinter, J., Hunt, R.: Tightening the net: a review of current and next generation spam filtering tools. Comput. Secur. 25, 566–578 (2006)

    Article  Google Scholar 

  23. Georgioua, E., Dikaiakosa, M.D., Stassopoulou, A.: On the properties of spam advertised URL addresses. J. Netw. Comput. Appl. 31(4), 966–985 (2008)

    Article  Google Scholar 

  24. Gordillo, J., Conde, E.: An HMM for detecting spam mail. Expert Syst. Appl. 33, 667–682 (2007)

    Article  Google Scholar 

  25. Hsiao, W.-F., Chang, T.-M.: An incremental cluster-based approach to spam filtering. Expert Syst. Appl. 34(3), 1599–1608 (2008)

    Article  Google Scholar 

  26. Lai, C.C.: An empirical study of three machine learning methods for spam filtering. Knowl. Based Syst. 20, 249–254 (2007)

    Article  Google Scholar 

  27. El-Alfy, E.M.: Learning methods for spam filtering. Int. J. Comput. Res. 16(4), 45 (2008)

    Google Scholar 

  28. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spamcategorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999)

    Article  Google Scholar 

  29. Wang, H.B., Yu, Y., Liu, Z.: SVM classifier incorporating feature selection using GA for spam detection. In: Proceedings of the 2005 International Conference on Embedded and Ubiquitous Computing, Nagasaki, Japan, pp. 1147–1154 (2005)

  30. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  31. George, J., Kumaraswamy, R.: A Hybrid Wavelet Kernel Construction for Support Vector Machine Classification. In: Proceedings of The 2008 International Conference on Data Mining, DMIN 2008, July 14–17, Las Vegas, USA, vol. 2. (2008)

  32. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Inf. Retr. 6(1), 49–73 (2003)

    Article  Google Scholar 

  33. http://www.cs.jhu.edu/~mdredze/datasets/image_spam/

  34. Huang, J., Kumar, S., Mitra, M., Zhu, W., Zabih, R.: Image indexing using color correlograms. IEEE computer society conference on computer vision and pattern recognition (CVPR) (1997)

  35. Kecman, V.: Learning and Soft Computing: Support vector Machines. Neural Networks and Fuzzy logic models. MIT Press, London (2001)

    MATH  Google Scholar 

  36. Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  37. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, UK (2004)

    Book  MATH  Google Scholar 

  38. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machine, Regularization, Optimization, and Beyond. MIT Press, London (2002)

    Google Scholar 

  39. Howley, T., Madden, M.G.: The genetic kernel support vector machine: description and evaluation. Artif. Intell. 24(3–4), 379–395 (2005)

    Article  Google Scholar 

  40. Gee, K.R.: Using latent semantic indexing to filter spam. In: Proceedings of the 2003 ACM Symposium on Applied Computing, Data Minning Track. ACM, New York, pp. 460–464. (2003)

  41. Gansterer, W.N., Janecek, A.G.K., Neumayer, R.: Spam Filtering Based on Latent Semantic Indexing. In: Survey of Text Mining II: Clustering, Classification, and Retrieval. Springer, London, pp. 165–183 (2007)

  42. Gansterer, W.N., Ilger, M., Lechner, P., Neumayer, R., Strauss, J.: Anti-spam methods-state of the art. Tech. rep (2005)

  43. Cormack, G.V.: Spam track overview. In: Proceedings of the 16th Text Retrieval Conference: TREC-2007, National Institute of Standards and Technology (NIST) (2007)

  44. Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Syst. Appl. 36, 10206–10222 (2009)

    Article  Google Scholar 

  45. Gomez, J.C., Moens, M.F.: PCA document reconstruction for email classification. Comput. Stat. Data Anal. 56, 741–751 (2012)

    Article  MathSciNet  Google Scholar 

  46. Yu, B., Zhu, D.H.: Combining neural networks and semantic feature space for email classification. Knowl. Based Syst. 22, 376–381 (2009)

    Article  Google Scholar 

  47. Crawford, E., Kay, J., McCreath, E.: Automatic induction of rules for email classification. In: Proceedings of the 6th Australasian Document Computing Symposium, Coffs Harbour, Australia, pp. 13–20 (2001)

  48. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Inf. Retr. 6(1), 49–73 (2003)

    Article  Google Scholar 

  49. Clark, J., Koprinska, I., Poon, J.: A neural network based approach to automated email classification. In Proceedings of IEEE/WIC International Conference on Web Intelligence, Halifax, Canada, pp. 702-705. (2003)

  50. Chen, D.H., Chen, T.J., Ming, H.: Spare email filter using naive bayesian, decision tree, neural network and adaboost (2003)

  51. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of naïve Bayesian anti-spam filtering. In: Proceedings of the 11th European Conference on Machine Learning: ECML 2009, Workshop on Machine Learning in the New Information Age. Springer-Verlag, Berlin, pp. 9–7. (2000)

  52. Crawford, E., Koprinska, I., Patrick, J.: Phrases and feature selection in email classification. In: Proceedings of the 9th Australasian document computing symposium, Melbourne, Australia (2004)

  53. Bezerra, G.B., Barra, T.V., Ferreira, H.M., Knidel, H., de Castro, L.N., Von Zuben, F.J.: An immunological filter for spam. In: Proceedings of the International Conference on Artificial Immune Systems, Oeiras, Portugal, pp. 446–458. (2006)

  54. Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International World Wide Web Conference: WWW 2007. ACM, New York, pp. 649–656. (2007)

  55. Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit: eCrime 2007. ACM, New York, pp. 60–69. (2007)

  56. Gansterer, W.N., Pölz, D.: E-mail classification for phishing defense. In: Proceedings of the 31st European Conference on Information Retrieval: ECIR 2009. Springer, Toulouse, pp. 449-460. (2009)

  57. Brutlag, J.D., Meek, C.: Challenges of the email domain for text classification. In: Proceedings of the 17th International Conference on Machine Learning: ICML 2000. Morgan Kaufmann, San Francisco (2000)

  58. Xia, Y., Wong, K.-F.: Binarization approaches to email categorization. In: Proceedings of the 23rd Annual ACM Symposium on Applied Computing: SAC 2008. ACM, New York, pp. 474–481. (2006)

  59. Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673–2698 (2006)

    MathSciNet  MATH  Google Scholar 

  60. Bíró, I., Szabó, J., Benczúr, A.A.: Latent Dirichlet allocation in web spam filtering. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web: AIRWeb 2008. ACM, New York, pp. 29–32. (2008)

  61. Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Words vs. character n-grams for anti-spam filtering. Int. J. Artif. Intell. Tools 16(6), 1047–1067 (2007)

  62. Gomez, J.C., Moens, M.-F.: Using biased discriminant analysis for email filtering. In: Proceedings of the 14th International Conference KES 2010. Springer, Berlin, pp. 566–575. (2010)

  63. Gomez, J.C., Moens, M.-F.: Highly discriminative statistical features for email classification. Knowl. Inf. Syst. 31(1), 23–53 (2011)

    Article  Google Scholar 

  64. Janecek, A.G.K., Gansterer, W.N.: Utilizing Nonnegative Matrix Factorization for Email Classification Problems. Wiley, Chichester (2010)

    Book  Google Scholar 

  65. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986)

    Book  MATH  Google Scholar 

  66. Snyder, J.: Spam in the wild, the sequel. Network World 12/20/04 (2004)

  67. Kumaresan, T., Palanisamy, C.: E-mail spam classification using S-Cuckoo search and support vector machine. Int. J. Bio-Inspired Comput. 9(3), 142–156 (2017)

    Article  Google Scholar 

  68. Kumaresan, T., Sanjushree, S., Palanisamy, C.: Image spam detection using color features and K-Nearest neighbor classification. Int. J. Comput. Inf. Syst. Control Eng. 8(10), 1746–1749 (2014)

    Google Scholar 

  69. Kumaresan, T., Sanjushree, S., Suhasini, K., Palanisamy, C.: Image spam filtering using support vector machine and particle swarm optimization. Int. J. Comput. Appl. 1, 17–21 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to T. Kumaresan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumaresan, T., Saravanakumar, S. & Balamurugan, R. Visual and textual features based email spam classification using S-Cuckoo search and hybrid kernel support vector machine. Cluster Comput 22 (Suppl 1), 33–46 (2019). https://doi.org/10.1007/s10586-017-1615-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1615-8

Keywords

Navigation