A modified content-based evolutionary approach to identify unsolicited emails

Trivedi, Shrawan Kumar; Dey, Shubhamoy

doi:10.1007/s10115-018-1271-1

A modified content-based evolutionary approach to identify unsolicited emails

Regular Paper
Published: 10 September 2018

Volume 60, pages 1427–1451, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Shrawan Kumar Trivedi¹ &
Shubhamoy Dey²

208 Accesses
3 Citations
Explore all metrics

Abstract

This computational research seeks to classify unsolicited versus legitimate emails. A modified version of an existing genetic programming (GP) classifier—i.e., modified genetic programming (MGP)—is implemented to build an ensemble of classifiers to identify unsolicited emails. The proposed classifier is assessed using informative features extracted from two corpora (Enron and SpamAssassin) with the help of the greedy stepwise feature search method. Further, a comparative study is performed with other popular classifiers, such as Bayesian network, naïve Bayes, decision tree, random forest (RF), support vector machine (SVM), and GP. Further the results are validated with 20-fold cross-validation and paired T test. The results prove that the proposed classifier performs better in terms of accuracy and false-positive detection in comparison with the other machine learning classifiers tested in this study. Using different training and testing a set of email files from the Enron corpus, ensemble-based classifiers, such as boosted SVM, boosted Bayesian, boosted naïve Bayesian, RF, and the proposed MGP classifier, are tested and compared on all metrics, including training and testing time. The findings suggest that the MGP classifier with the greedy stepwise feature search method offers an improvement over alternative methods in detecting unsolicited emails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Cisco Systems, SenderBase real-time threat monitoring system, http://www.senderbase.org
Lai CC (2007) An empirical study of three machine learning methods for spam filtering. Knowl Based Syst 20(3):249–254
Article Google Scholar
Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136
Article Google Scholar
Clark KP (2008) A survey of content-based spam classifiers
Jakobsson M (2016) Traditional countermeasures to unwanted email. In: Understanding social engineering based scams, pp 51–62. Springer, New York
Cole WK (2007) Blacklists, Blocklists, DNSBL’s, and survival. Retrieved on, 01-26
Leiba B, Ossher J, Rajan VT, Segal R, Wegman MN (2005). SMTP Path Analysis. In: CEAS
Levine JR (2005) Experiences with Greylisting. In: CEAS
Sharaff A, Nagwani NK, Dhadse A (2016). Comparative study of classification algorithms for spam email detection. In: Emerging research in computing, information, communication and applications, pp 237–244. Springer India
Androutsopoulos I, Koutsias J, Chandrinos K V, Paliouras G, Spyropoulos CD (2000) An evaluation of naive bayesian anti-spam filtering, pp 26–28
Sharma AK, Sahni S (2011) A comparative study of classification algorithms for spam email data analysis. Int J Comput Sci Eng 3(5):1890–1895
Google Scholar
Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054
Article Google Scholar
Trivedi SK, Dey S (2014) A study of Ensemble based evolutionary classifiers for detecting unsolicited emails. In: Proceedings of the 2014 conference on research in adaptive and convergent systems (pp 46–51). ACM
Li J, Ping W (2009) The e-mail filtering system based on improved genetic algorithm. In: Proceedings of the 2009 international workshop on information security and application (IWISA 2009), ISBN 978-952-5726-06-0. 16
Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144
Article Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, Vol 62, pp 98–105
Rennie J (2000) ifile: An application of machine learning to e-mail filtering. In: Proc. KDD 2000 workshop on text mining, Boston, MA
Pantel P, Lin D (1998) Spamcop: a spam classification and organization program. In: Proceedings of AAAI-98 workshop on learning for text categorization, pp 95–98
Trivedi SK, Dey S, Shikhar P (2013) Effect of various kernels and feature selection methods on SVM performance for detecting email spams. Int J Comput Appl 66(21)
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040
Carreras X, Màrquez L (2001) Boosting trees for clause splitting. In: Proceedings of the 2001 workshop on computational natural language learning-Volume 7 (p 26). Association for Computational Linguistics
Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136
Article Google Scholar
Rios G, Zha H (2004) Exploring Support vector machines and random forests for spam detection. In: CEAS
Trivedi S, Dey S (2013) Effect of feature selection methods on machine learning classifiers for detecting email spams. In: Proceedings of the 2013 ACM research in applied computation symposium, Montreal, QC, canada
Colleoni E, Rozza A, Arvidsson A (2014) Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data. J Commun 64(2):317–332
Article Google Scholar
Trivedi SK, Dey S (2016) A comparative study of various supervised feature selection methods for spam classification. In: Proceedings of the second international conference on information and communication technology for competitive strategies (p 64). ACM
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Machine Learning: Proceedings of the eleventh international conference, pp 121–129
Aas K, Eikvil L (1999) Text categorisation: a survey. Raport NR, 941
Shengen L, Xiaofei N, Peiqi L, Lin W (2011) Generating new features using genetic programming to detect link spam. In: 2011 International conference on intelligent computation technology and automation (ICICTA), vol 1, pp 135–138. IEEE
Trivedi SK, Dey S (2013) An enhanced genetic programming approach for detecting unsolicited emails. In: 2013 IEEE 16th international conference on computational science and engineering (CSE), pp 1153–1160. IEEE
Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge
MATH Google Scholar
De Jong K (1975) The analysis of the behavior of class of genetic adaptive systems, doctorate these. Department of computer Science, University of Michigan, Ann Arbor
Google Scholar
Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. IEEE Trans Sys Man Cybern 16(1):122–128
Article Google Scholar
Kapoor V, Dey S, Khurana AP (2011) An empirical study of the role of control parameters of genetic algorithms in function optimization problems. Int J Comput Appl 31(6):20–26
Google Scholar
Lewis DD Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proceedings of 10th European conference on machine learning (ECML-98), 1998, pp 4–15
Trivedi SK, Dey S (2016) A combining classifiers approach for detecting email spams. In: 2016 30th international conference on advanced information networking and applications workshops (WAINA), pp 355–360. IEEE
Tripathi A, Trivedi SK (2016) Sentiment analysis of Indian movie review with various feature selection techniques. In IEEE international conference on advances in computer applications (ICACA), pp 181–185. IEEE
Trivedi SK (2016) A study of machine learning classifiers for spam detection. In 2016 4th international symposium on computational and business intelligence (ISCBI), pp 176–180. IEEE
Vapnik VN (1999) An overview of statistical learning theory. In: IEEE transactions on neural network, Vol 10, No 5, pp 988–998. 6
Trivedi SK, Dey S (2014) Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails. ACM SIGAPP Appl Comput Rev 14(1):53–61
Article Google Scholar
Obimbo C, Nyakundi E Comparison of SVMs with radial-basis function & polynomial kernels in classification of categories in Intrusion Detection
Trivedi SK, Trivedi SK, Dey S, Dey S (2016) A novel committee selection mechanism for combining classifiers to detect unsolicited emails. VINE J Inf Knowl Manag Syst 46(4):524–548
Article Google Scholar
Sharma A, Dey S (2013) A boosted SVM based sentiment analysis approach for online opinionated text. In: Proceedings of the 2013 research in adaptive and convergent systems, pp 28–34. ACM

Download references

Author information

Authors and Affiliations

Indian Institute of Management Sirmaur, Sirmaur, HP, India
Shrawan Kumar Trivedi
Indian Institute of Management Indore, Indore, MP, India
Shubhamoy Dey

Authors

Shrawan Kumar Trivedi
View author publications
You can also search for this author in PubMed Google Scholar
Shubhamoy Dey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shrawan Kumar Trivedi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Trivedi, S.K., Dey, S. A modified content-based evolutionary approach to identify unsolicited emails. Knowl Inf Syst 60, 1427–1451 (2019). https://doi.org/10.1007/s10115-018-1271-1

Download citation

Received: 12 May 2017
Revised: 07 December 2017
Accepted: 26 May 2018
Published: 10 September 2018
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s10115-018-1271-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A modified content-based evolutionary approach to identify unsolicited emails

Abstract

Access this article

Similar content being viewed by others

An Efficient Email Spam Detection Utilizing Machine Learning Approaches

Extra-Tree Classifier with Metaheuristics Approach for Email Classification

Feature Selection for Email Phishing Detection Using Machine Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A modified content-based evolutionary approach to identify unsolicited emails

Abstract

Access this article

Similar content being viewed by others

An Efficient Email Spam Detection Utilizing Machine Learning Approaches

Extra-Tree Classifier with Metaheuristics Approach for Email Classification

Feature Selection for Email Phishing Detection Using Machine Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation