Abstract
Unauthorized service or product advertising messages sent via electronic mails are called as spam e-mails. Detecting spam e-mail remains a challenging task. Existing countermeasures based on the statistical keyword, conceptual and IP address-based blacklists are not efficient due to difficulty in finding new attack patterns generated by the Internet of Things botnet devices. The other spam detection approaches rely on a hybrid of conceptual knowledge engineering with machine learning techniques. But, modern spammers evade the hybrid techniques through word polysemy and word ambiguity due to the context-sensitive nature of words. In this paper, the integration of Naïve Bayesian classification with conceptual and semantic similarity technique is proposed to combat the ambiguity raised through polysemy in spam detection. To analyse the effectiveness of our approach, the experiments were conducted on benchmark data sets such as Spambase, PU1, Enron corpus, and Ling-spam. From the experimental results, it is evident that our proposed system achieves high accuracy of 98.89% than the existing approaches.
Similar content being viewed by others
References
The History of Spam. Switzerland (2014). https://www.internetsociety.org/sites/default/files/HistoryofSpam.pdf
Robertson J. E-mail spam goes artisanal. http://www.bloomberg.com/news/articles/2016-01-19/E-mail-spam-goes-artisanal
Siponen M, Stucke C (2006) Effective anti-spam strategies in companies: an international study. In: Proceedings of HICSS’06, vol 6
Bueti MC (2005) ITU survey on Anti_Spam Legistation Worldwide. WSIS Thematic Meeting on Cybersecurity, Document CYB/06, Geneva
Swindle O (2003) Statement before the House Subcommittee on Commerce, et all. Federal Trade Commission. June 11, 2003
Kaspersky Lab reports significant increase in malicious spam e-mails in Q1 2016. http://usa.kaspersky.com/about-us/press-center/press-releases/2016/Kaspersky-Lab-Reports-Significant-Increase-in-Malicious-Spam-E-mails-in-Q1-2016
Li CH, Huang JX (2012) Spam filtering using semantic similarity approach and adaptive BPNN. Neurocomputing 92:88–97
Nasir JA, Varlamis I, Karim A, Tsatsaronis G (2013) Semantic smoothing for text clustering. Knowl-Based Syst 54:216–229
Amayri O, Bouguila N (2010) A study of spam filtering using support vector machines. Artif Intell Rev 34:73–108
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes—which naive bayes? In: Third Conference on E-Mail and Anti-Spam (CEAS)
Awad WA, Elseuofi SM (2011) Machine Learning methods for E-mail classification. Int J Comput Appl 16(1):39–45. https://doi.org/10.5120/1974-2646
Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31
Sarafijanovic S, Boudec JL (2008) Artificial immune system for collaborative spam filtering. In: Proceedings of NICSO 2007, The Second Workshop on Nature Inspired Cooperative Strategies for Optimization
Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A case-based technique for tracking concept drift in spam filtering. Knowl-Based Syst 18:187–195
Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence
Elssied NOF, Ibrahim O, Osman AH (2015) Enhancement of spam detection mechanism based on hybrid k-mean clustering and support vector machine. Soft Comput 19(11):3237–3248
Eyharabide V, Amandi A (2008) Semantic spam filtering from personalized ontologies. J Web Eng 7(2):158–176
Sculley D, Wachman GM, Brodley CE (2006) Spam filtering using inexact string matching in explicit feature space with on-line linear classifiers. In: Proceedings of Fifteenth Text Retrieval Conference, Section 2
Dai Y, Tada S, Ban T, Nakazato J, Shimamura J (2014) Detecting malicious spam mails: an online machine learning approach. In: 21st International Conference on Neural Information Processing (ICONIP), pp 365–372
Perez-Diaz N, Ruano-Ordas D, Fdez-Riverola F, Mendez JR (2016) Boosting accuracy of classical machine learning antispam classifiers in real scenarios by applying rough set theory. Sci Program 2016:1–11
Zhou B, Yao Y, Luo J (2014) Cost-sensitive three-way E-mail spam filtering. J Intell Inf Syst 42(1):19–45
Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM03), Melbourne, FL, pp 541–544
Hu W, Du J, Xing Y (2016) Spam filtering by semantics-based text classification. In: Proceedings of the 8th International Conference on Advanced Computational Intelligence, pp 89–94
Stolfo S, Hershkop S (2006) Behavior-based modeling and its application to E-mail analysis. ACM Trans Internet Technol 6:187–221
Yeh CY, Wu CH, Doong SH (2005) Effective spam classification based on meta-heuristics. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp 3872–3877
Brendel R, Krawczyk H (2007) Detection methods of dynamic spammers behavior. In: International conference on dependability of computer systems, pp. 145–152
Hsiao WF, Chang TM (2008) An incremental cluster-based approach to spam filtering. Expert Syst Appl 34(3):1599–1608
Haidar AA, Rocha LM (2008) Adaptive spam detection inspired by a cross-regulation model of immune dynamics: a study of concept drift. Lecture notes in computer science, vol 5132. Springer, Berlin
Shih DH, Chiang HS, Lin B (2008) Collaborative spam filtering with heterogeneous agents. Expert Syst Appl 34(4):1555–1566
Yih WT, Goodman J, Hulton G (2006) Learning at low false positive rates. In: Proceedings of the Third Conference on E-mail and Anti-Spam
Wikipedia dataset https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
The Enron corpus http://www.edrm.net/resources/data-sets/edrm-enron-E-mail-data-set/
The PU corpora http://www.iit.demokritos.gr/skel/i-config/
The Spambase dataset https://archive.ics.usci.edu/ml/datasets/spambase
The Ling-Spam dataset http://csmining.org/index.php/ling-spam-datasets.html
Blanzieri E, Bryl A (2008) A survey of learning-based techniques of E-mail spam filtering. Artif Intell Rev 29(1):63–92. https://doi.org/10.1007/s10462-009-9109-6
Bin X, Ruiguang L, Yashu L, Hanbing Y, Siyuan L, Honggang Z (2015) Filtering Chinese image spam using Pseudo-OCR. Chin J Electron 24(1):134–139
Wang J, Herath T, Chen R, Vishwanath A, Rao HR (2012) Phishing susceptibility: an investigation into the processing of a targeted spear phishing E-mail. IEEE Trans Prof Commun 55(4):345–362
Jung JJ (2009) Towards collaborative spam filtering based on collective intelligence. In: First Asian Conference on Intelligent Information and Database Systems, pp 356–361
Chirita PA, Nejdl W, Zamfir C (2005) Preventing shilling attacks in online recommender systems. In: Proceedings of the Seventh ACM International Workshop on Web Information and Data Management
Hau X, Lee PN, Jung JJ, Sadeghi-niaraki A (2013) Collaborative spam filtering based on incremental ontology learning. Telecommun Syst 52:693–700
Zhong Z, Ramaswamy L, Li K (2008) ALPACAS : a large-scale privacy-aware collaborative anti-spam system. In: INFOCOM. The 27th IEEE Conference on Computer Communications. https://doi.org/10.1109/infocom.2008.104
Cunningham P, Nowlan N, Delany SJ, Haahr M (1994) A case-based approach to spam filtering that can track concept drift. no. Ml
Xu H, Yu B (2010) Automatic thesaurus construction for spam filtering using revised back propagation neural network. Expert Syst Appl 37(1):18–23
Wu CH (2009) Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Syst Appl 36(3):4321–4330
Bahgat EM, Moawad IF (2016) Semantic-based feature reduction approach for E-mail classification. In: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, pp 53–63
Hu W, Du J, Xing Y (2016) Spam filtering by semantics-based text classification. In: Proceedings of the 8th International Conference on Advanced Computational Intelligence, ICACI, pp 89–94
Han A, Kim H, Ha I, Jo G (2008) Semantic analysis of user behaviors for detecting spam mail. In: IEEE International Workshop on Semantic Computing and Applications, pp 91–95
Acknowledgements
This work partially supported by Ministry of Human Resource and Development (MHRD), New Delhi, India.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Venkatraman, S., Surendiran, B. & Arun Raj Kumar, P. Spam e-mail classification for the Internet of Things environment using semantic similarity approach. J Supercomput 76, 756–776 (2020). https://doi.org/10.1007/s11227-019-02913-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02913-7