Skip to main content
Log in

Spam e-mail classification for the Internet of Things environment using semantic similarity approach

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Unauthorized service or product advertising messages sent via electronic mails are called as spam e-mails. Detecting spam e-mail remains a challenging task. Existing countermeasures based on the statistical keyword, conceptual and IP address-based blacklists are not efficient due to difficulty in finding new attack patterns generated by the Internet of Things botnet devices. The other spam detection approaches rely on a hybrid of conceptual knowledge engineering with machine learning techniques. But, modern spammers evade the hybrid techniques through word polysemy and word ambiguity due to the context-sensitive nature of words. In this paper, the integration of Naïve Bayesian classification with conceptual and semantic similarity technique is proposed to combat the ambiguity raised through polysemy in spam detection. To analyse the effectiveness of our approach, the experiments were conducted on benchmark data sets such as Spambase, PU1, Enron corpus, and Ling-spam. From the experimental results, it is evident that our proposed system achieves high accuracy of 98.89% than the existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. The History of Spam. Switzerland (2014). https://www.internetsociety.org/sites/default/files/HistoryofSpam.pdf

  2. Robertson J. E-mail spam goes artisanal. http://www.bloomberg.com/news/articles/2016-01-19/E-mail-spam-goes-artisanal

  3. Siponen M, Stucke C (2006) Effective anti-spam strategies in companies: an international study. In: Proceedings of HICSS’06, vol 6

  4. Bueti MC (2005) ITU survey on Anti_Spam Legistation Worldwide. WSIS Thematic Meeting on Cybersecurity, Document CYB/06, Geneva

  5. Swindle O (2003) Statement before the House Subcommittee on Commerce, et all. Federal Trade Commission. June 11, 2003

  6. Kaspersky Lab reports significant increase in malicious spam e-mails in Q1 2016. http://usa.kaspersky.com/about-us/press-center/press-releases/2016/Kaspersky-Lab-Reports-Significant-Increase-in-Malicious-Spam-E-mails-in-Q1-2016

  7. Li CH, Huang JX (2012) Spam filtering using semantic similarity approach and adaptive BPNN. Neurocomputing 92:88–97

    Article  Google Scholar 

  8. Nasir JA, Varlamis I, Karim A, Tsatsaronis G (2013) Semantic smoothing for text clustering. Knowl-Based Syst 54:216–229

    Article  Google Scholar 

  9. Amayri O, Bouguila N (2010) A study of spam filtering using support vector machines. Artif Intell Rev 34:73–108

    Article  Google Scholar 

  10. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes—which naive bayes? In: Third Conference on E-Mail and Anti-Spam (CEAS)

  11. Awad WA, Elseuofi SM (2011) Machine Learning methods for E-mail classification. Int J Comput Appl 16(1):39–45. https://doi.org/10.5120/1974-2646

    Article  Google Scholar 

  12. Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31

    Article  Google Scholar 

  13. Sarafijanovic S, Boudec JL (2008) Artificial immune system for collaborative spam filtering. In: Proceedings of NICSO 2007, The Second Workshop on Nature Inspired Cooperative Strategies for Optimization

  14. Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A case-based technique for tracking concept drift in spam filtering. Knowl-Based Syst 18:187–195

    Article  Google Scholar 

  15. Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence

  16. Elssied NOF, Ibrahim O, Osman AH (2015) Enhancement of spam detection mechanism based on hybrid k-mean clustering and support vector machine. Soft Comput 19(11):3237–3248

    Article  Google Scholar 

  17. Eyharabide V, Amandi A (2008) Semantic spam filtering from personalized ontologies. J Web Eng 7(2):158–176

    Google Scholar 

  18. Sculley D, Wachman GM, Brodley CE (2006) Spam filtering using inexact string matching in explicit feature space with on-line linear classifiers. In: Proceedings of Fifteenth Text Retrieval Conference, Section 2

  19. Dai Y, Tada S, Ban T, Nakazato J, Shimamura J (2014) Detecting malicious spam mails: an online machine learning approach. In: 21st International Conference on Neural Information Processing (ICONIP), pp 365–372

    Chapter  Google Scholar 

  20. Perez-Diaz N, Ruano-Ordas D, Fdez-Riverola F, Mendez JR (2016) Boosting accuracy of classical machine learning antispam classifiers in real scenarios by applying rough set theory. Sci Program 2016:1–11

    Google Scholar 

  21. Zhou B, Yao Y, Luo J (2014) Cost-sensitive three-way E-mail spam filtering. J Intell Inf Syst 42(1):19–45

    Article  Google Scholar 

  22. Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM03), Melbourne, FL, pp 541–544

  23. Hu W, Du J, Xing Y (2016) Spam filtering by semantics-based text classification. In: Proceedings of the 8th International Conference on Advanced Computational Intelligence, pp 89–94

  24. Stolfo S, Hershkop S (2006) Behavior-based modeling and its application to E-mail analysis. ACM Trans Internet Technol 6:187–221

    Article  Google Scholar 

  25. Yeh CY, Wu CH, Doong SH (2005) Effective spam classification based on meta-heuristics. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp 3872–3877

  26. Brendel R, Krawczyk H (2007) Detection methods of dynamic spammers behavior. In: International conference on dependability of computer systems, pp. 145–152

  27. Hsiao WF, Chang TM (2008) An incremental cluster-based approach to spam filtering. Expert Syst Appl 34(3):1599–1608

    Article  Google Scholar 

  28. Haidar AA, Rocha LM (2008) Adaptive spam detection inspired by a cross-regulation model of immune dynamics: a study of concept drift. Lecture notes in computer science, vol 5132. Springer, Berlin

    Google Scholar 

  29. Shih DH, Chiang HS, Lin B (2008) Collaborative spam filtering with heterogeneous agents. Expert Syst Appl 34(4):1555–1566

    Article  Google Scholar 

  30. Yih WT, Goodman J, Hulton G (2006) Learning at low false positive rates. In: Proceedings of the Third Conference on E-mail and Anti-Spam

  31. Wikipedia dataset https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

  32. The Enron corpus http://www.edrm.net/resources/data-sets/edrm-enron-E-mail-data-set/

  33. The PU corpora http://www.iit.demokritos.gr/skel/i-config/

  34. The Spambase dataset https://archive.ics.usci.edu/ml/datasets/spambase

  35. The Ling-Spam dataset http://csmining.org/index.php/ling-spam-datasets.html

  36. https://www.rstudio.com/

  37. Blanzieri E, Bryl A (2008) A survey of learning-based techniques of E-mail spam filtering. Artif Intell Rev 29(1):63–92. https://doi.org/10.1007/s10462-009-9109-6

    Article  Google Scholar 

  38. Bin X, Ruiguang L, Yashu L, Hanbing Y, Siyuan L, Honggang Z (2015) Filtering Chinese image spam using Pseudo-OCR. Chin J Electron 24(1):134–139

    Article  Google Scholar 

  39. Wang J, Herath T, Chen R, Vishwanath A, Rao HR (2012) Phishing susceptibility: an investigation into the processing of a targeted spear phishing E-mail. IEEE Trans Prof Commun 55(4):345–362

    Article  Google Scholar 

  40. Jung JJ (2009) Towards collaborative spam filtering based on collective intelligence. In: First Asian Conference on Intelligent Information and Database Systems, pp 356–361

  41. Chirita PA, Nejdl W, Zamfir C (2005) Preventing shilling attacks in online recommender systems. In: Proceedings of the Seventh ACM International Workshop on Web Information and Data Management

  42. Hau X, Lee PN, Jung JJ, Sadeghi-niaraki A (2013) Collaborative spam filtering based on incremental ontology learning. Telecommun Syst 52:693–700

    Google Scholar 

  43. Zhong Z, Ramaswamy L, Li K (2008) ALPACAS : a large-scale privacy-aware collaborative anti-spam system. In: INFOCOM. The 27th IEEE Conference on Computer Communications. https://doi.org/10.1109/infocom.2008.104

  44. Cunningham P, Nowlan N, Delany SJ, Haahr M (1994) A case-based approach to spam filtering that can track concept drift. no. Ml

  45. Xu H, Yu B (2010) Automatic thesaurus construction for spam filtering using revised back propagation neural network. Expert Syst Appl 37(1):18–23

    Article  Google Scholar 

  46. Wu CH (2009) Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Syst Appl 36(3):4321–4330

    Article  Google Scholar 

  47. Bahgat EM, Moawad IF (2016) Semantic-based feature reduction approach for E-mail classification. In: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, pp 53–63

    Google Scholar 

  48. Hu W, Du J, Xing Y (2016) Spam filtering by semantics-based text classification. In: Proceedings of the 8th International Conference on Advanced Computational Intelligence, ICACI, pp 89–94

  49. Han A, Kim H, Ha I, Jo G (2008) Semantic analysis of user behaviors for detecting spam mail. In: IEEE International Workshop on Semantic Computing and Applications, pp 91–95

Download references

Acknowledgements

This work partially supported by Ministry of Human Resource and Development (MHRD), New Delhi, India.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Venkatraman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Venkatraman, S., Surendiran, B. & Arun Raj Kumar, P. Spam e-mail classification for the Internet of Things environment using semantic similarity approach. J Supercomput 76, 756–776 (2020). https://doi.org/10.1007/s11227-019-02913-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02913-7

Keywords

Navigation