Supervised classification of spam emails with natural language stylometry

Shams, Rushdi; Mercer, Robert E.

doi:10.1007/s00521-015-2069-7

Supervised classification of spam emails with natural language stylometry

Predictive Analytics Using Machine Learning
Published: 03 November 2015

Volume 27, pages 2315–2331, (2016)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Rushdi Shams¹ &
Robert E. Mercer¹

1124 Accesses
19 Citations
Explore all metrics

Abstract

Email spam is one of the biggest threats to today’s Internet. To deal with this threat, there are long-established measures like supervised anti-spam filters. In this paper, we report the development and evaluation of sentinel—an anti-spam filter based on natural language and stylometry attributes. The performance of the filter is evaluated not only on non-personalized emails (i.e., emails collected randomly) but also on personalized emails (i.e., emails collected from particular individuals). Among the non-personalized datasets are CSDMC2010, SpamAssassin, and LingSpam, while the Enron-Spam collection comprises personalized emails. The proposed filter extracts natural language attributes from email text that are closely related to writer stylometry and generate classifiers using multiple learning algorithms. Experimental outcomes show that classifiers generated by meta-learning algorithms such as adaboostm1 and bagging are the best, performing equally well and surpassing the performance of a number of filters proposed in previous studies, while a random forest generated classifier is a close second. On the other hand, the performance of classifiers using support vector machine and Naïve Bayes is not satisfactory. In addition, we find much improved results on personalized emails and mixed results on non-personalized emails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques

Article 21 March 2022

Early dementia detection with speech analysis and machine learning techniques

Article Open access 11 April 2024

Notes

Most of the public email datasets are imbalanced [19].
Downloadable at http://nlp.stanford.edu/software/tagger.shtml.
Available at: http://www.languagetool.org/java-api/.
Downloadable at http://jsoup.org/download.
http://cran.r-project.org/web/packages/Boruta/index.html.
Downloadable at http://spamassassin.apache.org/publiccorpus/.
Downloadable at http://csmining.org/index.php/spam-email-datasets-.html.
Downloadable at http://csmining.org/index.php/ling-spam-datasets.html.
Downloadable at https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam.
Consult with http://www.projecthoneypot.org.
Overview at http://untroubled.org/spam.

References

Abi-Haidar A, Rocha LM (2008a) Adaptive spam detection inspired by a cross-regulation model of immune dynamics: a study of concept drift. In: Artificial immune systems. Springer, Berlin, pp 36–47
Abi-Haidar A, Rocha LM (2008b) Adaptive spam detection inspired by the immune system. In: ALIFE, pp 1–8
Afroz S, Brennan M, Greenstadt R (2012) Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE symposium on security and privacy (SP), pp 461–475
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: 23rd Annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 160–167
Bickel S (2006) Ecml-pkdd discovery challenge 2006 overview. In: Proceedings of the ECML/PKDD discovery challenge workshop, pp 1–9
Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell 29(1):63–92
Article Google Scholar
Bratko A, Cormack GV, R D, Filipic B, Chan P, Lynam TR (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
MathSciNet MATH Google Scholar
Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: RANLP-2001, 4th International conference on recent advances in natural language processing, pp 58–64
Cheng V, Li C (2007) Combining supervised and semi-supervised classifier for personalized spam filtering. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 449–456. doi:10.1007/978-3-540-71701-0_45
Cheng V, Li CH (2006) Personalized spam filtering with semi-supervised classifier ensemble. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006), pp 195–201. doi:10.1109/WI.2006.132
Commtouch (2013) Internet threats trend report. Technical report, Commtouch, USA. http://www.commtouch.com/uploads/2013/04/Commtouch-Internet-Threats-Trend-Report-2013-April.pdf
Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of the sixteenth text retrieval conference, TREC 2007. http://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf
Cormack GV, Bratko A (2006) Batch and online spam filter comparison. In: Conference on email and anti-spam, CEAS 2006, Mountain View, CA
Cormack GV, Lynam TR (2005) TREC 2005 spam track overview. In: Proceedings of the fourteenth text retrieval conference, TREC 2005. http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf
Drummond C, Holte R (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130
Article Google Scholar
Goodman J, Cormack GV, Heckerman D (2007) Spam and the ongoing battle for the inbox. Commun ACM 50(2):24–33
Article Google Scholar
Graham P (2003) A plan for spam. http://paulgraham.com/spam.html
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10,206–10,222
Article Google Scholar
Haider P, Brefeld U, Scheffer T (2007) Supervised clustering of streaming data for email batch detection. In: 24th International conference on machine learning. ACM, pp 345–352
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer Series in Statistics. Springer, Berlin
Book MATH Google Scholar
Holte RC, Drummond C (2008) Cost-sensitive classifier evaluation using cost curves. Lecture Notes in Computer Science. In: Washio T, Suzuki E, Ting KM, Inokuchi A (eds) Pacific-Asia conference on knowledge discovery and data mining (PAKDD), vol 5012. Springer, Berlin, pp 26–29
Chapter Google Scholar
Hu Y, Guo C, Ngai EWT, Liu M, Chen S (2010) A scalable intelligent non-content-based spam-filtering framework. Expert Syst Appl 37(12):8557–8565
Article Google Scholar
Iqbal F, Khan LA, Fung BCM, Debbabi M (2010) E-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM symposium on applied computing, ACM, New York, NY, SAC ’10, pp 1591–1598
Issac B, Jap WJ, Sutanto JH (2009) Improved Bayesian anti-spam filter implementation and analysis on independent spam corpuses. In: 2009 International conference on computer engineering and technology, vol 02. IEEE Computer Society, pp 326–330
Kosmopoulos A, Paliouras G, Androutsopoulos A (2008) Adaptive spam filtering using only naive Bayes text classifiers. In: Fifth conference on email and anti-spam (CEAS 2008)
Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13. http://www.jstatsoft.org/v36/i11/
Lai CC, Tsai MC (2004) An empirical performance comparison of machine learning methods for spam e-mail categorization. In: Fourth international conference on hybrid intelligent systems. IEEE Computer Society, HIS ’04, pp 44–48
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
Ma Q, Qin Z, Zhang F, Liu Q (2010) Text spam neural network classification algorithm. In: 2010 International conference on communications. Circuits and systems (ICCCAS), pp 466–469
Meng Y, Li W, Kwok L (2014) Enhancing email classification using data reduction and disagreement-based semi-supervised learning. In: IEEE international conference on communications, ICC 2014, Sydney, Australia, pp 622–627. doi:10.1109/ICC.2014.6883388
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive Bayes—Which naive Bayes? In: Third conference on email and anti-spam (CEAS)
Mojdeh M, Cormack GV (2008) Semi-supervised spam filtering: does it work? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2008, pp 745–746. doi:10.1145/1390334.1390482
Orăsan C, Krishnamurthy R (2002) A corpus-based investigation of junk emails. In: Third international conference on language resources and evaluation (LREC-2002), Spain, pp 1773–1780
Prabhakar R, Basavaraju M (2010) A novel method of spam mail detection using text based clustering approach. Int J Comput Appl 5(4):15–25. published By Foundation of Computer Science
Qaroush A, Khater IM, Washaha M (2012) Identifying spam e-mail based-on statistical header features and sender behavior. In: CUBE international information technology conference. ACM, pp 771–778
Razmara M, Razmara A, Narouei M (2012) Textual spam detection: an iterative pattern mining approach. World Appl Sci J 20(2):198–204
Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, AAAI Technical Report WS-98-05, pp 55–62
Schapire RE (1999) A brief introduction to boosting. In: 16th international joint conference on Artificial intelligence, vol 2, Morgan Kaufmann Publishers Inc., Los Altos, CA, IJCAI’99, pp 1401–1406
Shams R, Mercer RE (2013) Classifying spam emails using text and readability features. In: 2013 IEEE 13th international conference on data mining, pp 657–666. doi:10.1109/ICDM.2013.131
Shen X, Tseng GC, Zhang X, Wong WH (2003) On psi-learning. J Am Stat Assoc 98:724–734. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:98:y:2003:p:724-734
Sheu JJ (2009) An efficient two-phase spam filtering method based on e-mails categorization. Int J Netw Secur 9(1):34–43
Google Scholar
Sirisanyalak B, Sornil O (2007) Artificial immunity-based feature extraction for spam detection. In: Software engineering, artificial intelligence, networking, and parallel/distributed computing. SNPD 2007. Eighth ACIS international conference on, vol 3, pp 359–364
Vapnik V (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Wang J, Shen X (2007) Large margin semi-supervised learning. J Mach Learn Res 8:1867–1891. http://dl.acm.org/citation.cfm?id=1314561
Xu JM, Fumera G, Roli F, Zhou ZH (2009) Training spamassassin with active semi-supervised learning. In: Sixth conference on email and anti-spam
Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl Based Syst 24(6):904–914
Article Google Scholar
Ye M, Tao T, Mai FJ, Cheng XH (2008) A spam discrimination based on mail header feature and SVM. In: Fourth international conference on wireless communications, networking and mobile computing (WiCom08), pp 1–4
Zhan J, Oommen BJ, Crisostomo J (2011) Anomaly detection in dynamic systems using weak estimators. ACM Trans Internet Technol 11(1):3:1–3:16
Article Google Scholar
Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497
Article Google Scholar

Download references

Acknowledgments

Support for this work was provided through a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to Robert E. Mercer (Grant No. 36853–2010 RGPIN). We are indebted to Vangelis Metsis, Aris Kosmopoulos, and Robert Holte for their correspondences regarding the use of their term frequency attribute and Cost Curve Tool.

Author information

Authors and Affiliations

Cognitive Engineering Laboratory, Department of Computer Science, The University of Western Ontario, London, ON, Canada
Rushdi Shams & Robert E. Mercer

Authors

Rushdi Shams
View author publications
You can also search for this author in PubMed Google Scholar
Robert E. Mercer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rushdi Shams.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shams, R., Mercer, R.E. Supervised classification of spam emails with natural language stylometry. Neural Comput & Applic 27, 2315–2331 (2016). https://doi.org/10.1007/s00521-015-2069-7

Download citation

Received: 12 March 2015
Accepted: 26 September 2015
Published: 03 November 2015
Issue Date: November 2016
DOI: https://doi.org/10.1007/s00521-015-2069-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised classification of spam emails with natural language stylometry

Abstract

Access this article

Similar content being viewed by others

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques

Early dementia detection with speech analysis and machine learning techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supervised classification of spam emails with natural language stylometry

Abstract

Access this article

Similar content being viewed by others

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques

Early dementia detection with speech analysis and machine learning techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation