Skip to main content
Log in

Supervised classification of spam emails with natural language stylometry

  • Predictive Analytics Using Machine Learning
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Email spam is one of the biggest threats to today’s Internet. To deal with this threat, there are long-established measures like supervised anti-spam filters. In this paper, we report the development and evaluation of sentinel—an anti-spam filter based on natural language and stylometry attributes. The performance of the filter is evaluated not only on non-personalized emails (i.e., emails collected randomly) but also on personalized emails (i.e., emails collected from particular individuals). Among the non-personalized datasets are CSDMC2010, SpamAssassin, and LingSpam, while the Enron-Spam collection comprises personalized emails. The proposed filter extracts natural language attributes from email text that are closely related to writer stylometry and generate classifiers using multiple learning algorithms. Experimental outcomes show that classifiers generated by meta-learning algorithms such as adaboostm1 and bagging are the best, performing equally well and surpassing the performance of a number of filters proposed in previous studies, while a random forest generated classifier is a close second. On the other hand, the performance of classifiers using support vector machine and Naïve Bayes is not satisfactory. In addition, we find much improved results on personalized emails and mixed results on non-personalized emails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Most of the public email datasets are imbalanced [19].

  2. Downloadable at http://nlp.stanford.edu/software/tagger.shtml.

  3. Available at: http://www.languagetool.org/java-api/.

  4. Downloadable at http://jsoup.org/download.

  5. http://cran.r-project.org/web/packages/Boruta/index.html.

  6. Downloadable at http://spamassassin.apache.org/publiccorpus/.

  7. Downloadable at http://csmining.org/index.php/spam-email-datasets-.html.

  8. Downloadable at http://csmining.org/index.php/ling-spam-datasets.html.

  9. Downloadable at https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam.

  10. Consult with http://www.projecthoneypot.org.

  11. Overview at http://untroubled.org/spam.

References

  1. Abi-Haidar A, Rocha LM (2008a) Adaptive spam detection inspired by a cross-regulation model of immune dynamics: a study of concept drift. In: Artificial immune systems. Springer, Berlin, pp 36–47

  2. Abi-Haidar A, Rocha LM (2008b) Adaptive spam detection inspired by the immune system. In: ALIFE, pp 1–8

  3. Afroz S, Brennan M, Greenstadt R (2012) Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE symposium on security and privacy (SP), pp 461–475

  4. Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: 23rd Annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 160–167

  5. Bickel S (2006) Ecml-pkdd discovery challenge 2006 overview. In: Proceedings of the ECML/PKDD discovery challenge workshop, pp 1–9

  6. Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell 29(1):63–92

    Article  Google Scholar 

  7. Bratko A, Cormack GV, R D, Filipic B, Chan P, Lynam TR (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698

  8. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    MathSciNet  MATH  Google Scholar 

  9. Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: RANLP-2001, 4th International conference on recent advances in natural language processing, pp 58–64

  10. Cheng V, Li C (2007) Combining supervised and semi-supervised classifier for personalized spam filtering. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 449–456. doi:10.1007/978-3-540-71701-0_45

  11. Cheng V, Li CH (2006) Personalized spam filtering with semi-supervised classifier ensemble. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006), pp 195–201. doi:10.1109/WI.2006.132

  12. Commtouch (2013) Internet threats trend report. Technical report, Commtouch, USA. http://www.commtouch.com/uploads/2013/04/Commtouch-Internet-Threats-Trend-Report-2013-April.pdf

  13. Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of the sixteenth text retrieval conference, TREC 2007. http://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf

  14. Cormack GV, Bratko A (2006) Batch and online spam filter comparison. In: Conference on email and anti-spam, CEAS 2006, Mountain View, CA

  15. Cormack GV, Lynam TR (2005) TREC 2005 spam track overview. In: Proceedings of the fourteenth text retrieval conference, TREC 2005. http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf

  16. Drummond C, Holte R (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130

    Article  Google Scholar 

  17. Goodman J, Cormack GV, Heckerman D (2007) Spam and the ongoing battle for the inbox. Commun ACM 50(2):24–33

    Article  Google Scholar 

  18. Graham P (2003) A plan for spam. http://paulgraham.com/spam.html

  19. Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10,206–10,222

    Article  Google Scholar 

  20. Haider P, Brefeld U, Scheffer T (2007) Supervised clustering of streaming data for email batch detection. In: 24th International conference on machine learning. ACM, pp 345–352

  21. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer Series in Statistics. Springer, Berlin

    Book  MATH  Google Scholar 

  22. Holte RC, Drummond C (2008) Cost-sensitive classifier evaluation using cost curves. Lecture Notes in Computer Science. In: Washio T, Suzuki E, Ting KM, Inokuchi A (eds) Pacific-Asia conference on knowledge discovery and data mining (PAKDD), vol 5012. Springer, Berlin, pp 26–29

    Chapter  Google Scholar 

  23. Hu Y, Guo C, Ngai EWT, Liu M, Chen S (2010) A scalable intelligent non-content-based spam-filtering framework. Expert Syst Appl 37(12):8557–8565

    Article  Google Scholar 

  24. Iqbal F, Khan LA, Fung BCM, Debbabi M (2010) E-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM symposium on applied computing, ACM, New York, NY, SAC ’10, pp 1591–1598

  25. Issac B, Jap WJ, Sutanto JH (2009) Improved Bayesian anti-spam filter implementation and analysis on independent spam corpuses. In: 2009 International conference on computer engineering and technology, vol 02. IEEE Computer Society, pp 326–330

  26. Kosmopoulos A, Paliouras G, Androutsopoulos A (2008) Adaptive spam filtering using only naive Bayes text classifiers. In: Fifth conference on email and anti-spam (CEAS 2008)

  27. Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13. http://www.jstatsoft.org/v36/i11/

  28. Lai CC, Tsai MC (2004) An empirical performance comparison of machine learning methods for spam e-mail categorization. In: Fourth international conference on hybrid intelligent systems. IEEE Computer Society, HIS ’04, pp 44–48

  29. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/

  30. Ma Q, Qin Z, Zhang F, Liu Q (2010) Text spam neural network classification algorithm. In: 2010 International conference on communications. Circuits and systems (ICCCAS), pp 466–469

  31. Meng Y, Li W, Kwok L (2014) Enhancing email classification using data reduction and disagreement-based semi-supervised learning. In: IEEE international conference on communications, ICC 2014, Sydney, Australia, pp 622–627. doi:10.1109/ICC.2014.6883388

  32. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive Bayes—Which naive Bayes? In: Third conference on email and anti-spam (CEAS)

  33. Mojdeh M, Cormack GV (2008) Semi-supervised spam filtering: does it work? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2008, pp 745–746. doi:10.1145/1390334.1390482

  34. Orăsan C, Krishnamurthy R (2002) A corpus-based investigation of junk emails. In: Third international conference on language resources and evaluation (LREC-2002), Spain, pp 1773–1780

  35. Prabhakar R, Basavaraju M (2010) A novel method of spam mail detection using text based clustering approach. Int J Comput Appl 5(4):15–25. published By Foundation of Computer Science

  36. Qaroush A, Khater IM, Washaha M (2012) Identifying spam e-mail based-on statistical header features and sender behavior. In: CUBE international information technology conference. ACM, pp 771–778

  37. Razmara M, Razmara A, Narouei M (2012) Textual spam detection: an iterative pattern mining approach. World Appl Sci J 20(2):198–204

    Google Scholar 

  38. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, AAAI Technical Report WS-98-05, pp 55–62

  39. Schapire RE (1999) A brief introduction to boosting. In: 16th international joint conference on Artificial intelligence, vol 2, Morgan Kaufmann Publishers Inc., Los Altos, CA, IJCAI’99, pp 1401–1406

  40. Shams R, Mercer RE (2013) Classifying spam emails using text and readability features. In: 2013 IEEE 13th international conference on data mining, pp 657–666. doi:10.1109/ICDM.2013.131

  41. Shen X, Tseng GC, Zhang X, Wong WH (2003) On psi-learning. J Am Stat Assoc 98:724–734. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:98:y:2003:p:724-734

  42. Sheu JJ (2009) An efficient two-phase spam filtering method based on e-mails categorization. Int J Netw Secur 9(1):34–43

    Google Scholar 

  43. Sirisanyalak B, Sornil O (2007) Artificial immunity-based feature extraction for spam detection. In: Software engineering, artificial intelligence, networking, and parallel/distributed computing. SNPD 2007. Eighth ACIS international conference on, vol 3, pp 359–364

  44. Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  45. Wang J, Shen X (2007) Large margin semi-supervised learning. J Mach Learn Res 8:1867–1891. http://dl.acm.org/citation.cfm?id=1314561

  46. Xu JM, Fumera G, Roli F, Zhou ZH (2009) Training spamassassin with active semi-supervised learning. In: Sixth conference on email and anti-spam

  47. Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl Based Syst 24(6):904–914

    Article  Google Scholar 

  48. Ye M, Tao T, Mai FJ, Cheng XH (2008) A spam discrimination based on mail header feature and SVM. In: Fourth international conference on wireless communications, networking and mobile computing (WiCom08), pp 1–4

  49. Zhan J, Oommen BJ, Crisostomo J (2011) Anomaly detection in dynamic systems using weak estimators. ACM Trans Internet Technol 11(1):3:1–3:16

    Article  Google Scholar 

  50. Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497

    Article  Google Scholar 

Download references

Acknowledgments

Support for this work was provided through a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to Robert E. Mercer (Grant No. 36853–2010 RGPIN). We are indebted to Vangelis Metsis, Aris Kosmopoulos, and Robert Holte for their correspondences regarding the use of their term frequency attribute and Cost Curve Tool.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rushdi Shams.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shams, R., Mercer, R.E. Supervised classification of spam emails with natural language stylometry. Neural Comput & Applic 27, 2315–2331 (2016). https://doi.org/10.1007/s00521-015-2069-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-015-2069-7

Keywords

Navigation