Highly discriminative statistical features for email classification

Gomez, Juan Carlos; Boiy, Erik; Moens, Marie-Francine

doi:10.1007/s10115-011-0403-7

Highly discriminative statistical features for email classification

Regular Paper
Published: 18 May 2011

Volume 31, pages 23–53, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Juan Carlos Gomez¹,
Erik Boiy¹ &
Marie-Francine Moens¹

613 Accesses
3 Altmetric
Explore all metrics

Abstract

This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abu-Nimeh S, Nappa D, Wang X, Nair S (2007) A comparison of machine learning techniques for phishing detection. In: eCrime ’07: proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit. ACM, New York, pp 60–69
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: proceedings of the 1993 ACM SIGMOD international conference on management of Data. ACM, New York, NY, USA, pp 207–216
Aha DW, Kibler DF, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6: 37–66
Google Scholar
Androutsopoulos I, Koutsias J, Chandrinos KV, Ch KV, Paliouras G, Spyropoulos CD (2000) An evaluation of naïve Bayesian anti-spam filtering, pp 9–17
Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Comput 12(10): 2385–2404
Article Google Scholar
Bishop C (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
Google Scholar
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003) Hierarchical topic models and the nested Chinese restaurant process. In: Thrun S, Saul LK, Schölkopf B (eds) Neural information processing systems. MIT Press, Cambridge
Google Scholar
Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent dirichlet allocation. J Mach Learn Res 3: 2003
Google Scholar
Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of 15th conference on computational statistics (COMPSTAT 2002). Physica Verlag, Heidelberg, Germany
Brank J, Grobelnik M, Frayling MN, Mladenic D (2002) Feature selection using support vector machines. In: Proceedings of the third international conference on data mining methods and databases for engineering, finance, and other fields, Bologna, Italy, pp 25–27
Bratko A, Cormack G, Filipic B, Lynam T, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7: 2673–2698
MathSciNet Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
MathSciNet Google Scholar
Brutlag JD, Meek C (2000) Challenges of the email domain for text classification. In: ICML ’00: proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 103–110
Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: SIGIR ’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrievalm, pp 182–189
Carreras X, Márquez L, Salgado JG (2001) Boosting trees for anti-spam email filtering. In: RANLP-01: 4th international conference on recent advances in natural language processing pp 58–64
Chen C, Tian Y, Zhang C (2008) Spam filtering with several novel Bayesian classifiers. In: ICPR ’08: proceedings of the 19th international conference on pattern recognition, pp 1–4
Cheng H, Yan X, Han J, wei Hsu C (2007) Discriminative frequent pattern analysis for effective classification. In: IEEE 23rd international conference on data engineering, pp 716–725
Cormack GV (2007) Spam track overview. In: TREC-2007: sixteenth text retrieval conference
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press, Cambridge, UK
Google Scholar
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407
Article Google Scholar
Fette I, Sadeh N, Tomasic A (2007) Learning to detect phishing emails. In: WWW ’07: proceedings of the 16th international conference on World Wide Web. ACM, New York, NY, USA, pp 649–656
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, London
Google Scholar
Goodman J, Heckerman D, Rounthwaite R (2005) Stopping spam. Sci Am 292(4): 42–88
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Lear 46(1–3): 389–422
Article Google Scholar
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36: 10206–10222
Article Google Scholar
Hartley R, Schaffalizky F (2003) PowerFactorization: 3d reconstruction with missing or uncertain data. In: Australia–Japan advanced workshop on computer vision
Hofmann T (1999) Probabilistic latent semantic indexing. In: Uncertainty in artificial intelligence, pp 50–57
Hovold J (2005) Naïve Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds. In: NODALIDA ’05: proceedings of the 15th nordic conference of computational linguistics, pp 78–87
Huang TS, Dagli CK, Rajaram S, Chang EY, Mandel MI, Poliner GE, Ellis DPW (2008) Active learning for interactive multimedia retrieval. Proc IEEE 96(4): 648–667
Article Google Scholar
Ishii N, Murai T, Yamada T, Bao Y, Suzuki S (2006) Text classification: combining grouping, LSA and knn vs support vector machine. In: ‘Knowledge-Based Intelligent Information and Engineering Systems’ Vol. 4252, pp. 393–400
István B, Jácint S, András B (2008) Latent Dirichlet allocation in web spam filtering. In: AIRWeb ’08: proceedings of the 4th international workshop on adversarial information retrieval on the Web’ pp 29–32
Jolliffe IT (1986) Principal component analysis. Springer, New York
Google Scholar
Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(6): 1047–1067
Article Google Scholar
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391
Article Google Scholar
Meyer TA, Whateley B (2004) SpamBayes: effective open-source, Bayesian based, email classification syste. In: CEAS ’04: proceedings of the first conference on email and anti-spam
Mitchell TM (1997) Machine learning. McGraw-Hill Science/Engineering/Math, NY
Google Scholar
Mladenić D, Brank J, Grobelnik M, Milic-Frayling N (2004) Feature selection using linear classifier weights: interaction with classification models. In: SIGIR ’04: proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA pp 234–241
Moler CB, Stewart GW (1973) An algorithm for generalized matrix eigenvalue problems. SIAM: J Numer Anal (19):241–256
Platt JC (1998) Fast training of SVMs using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in kernel methods-support vector learning. MIT Press, Cambridge, pp 185–208
Pu Q, Yang G-W (2006) Short-text classification based on ICA and LSA. In: Advances in neural networks, vol 3972, pp. 265–270
Qian T, Xiong H, Wang Y, Chen E (2007) On the strength of hyperclique patterns for text categorization. Inf Sci 177(19): 4040–4058
Article MathSciNet Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo
Google Scholar
Robinson G (2003) A statistical approach to the spam problem. Linux J (107):3
Schneider K-M (2003) A comparison of event models for naïve Bayes anti-spam e-mail filtering. In: EACL ’03: proceedings of the tenth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 307–314
Siefkes C, Assis F, Chhabra S, Yerazunis WS (2004) Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: PKDD ’04: proceedings of the 8th European conference on principles and practice of knowledge discovery in databases, vol 3202. Springer, Morristown, NJ, USA, pp. 410–421
Torkkola K (2004) Discriminative features for document classification. Pattern Anal Appl 6: 301–308
Article MathSciNet Google Scholar
Tsymbal A, Puuronen S, Pechenizkiy M, Baumgarten M, Patterson DW (2002) Eigenvector-based feature extraction for classification. In: Haller SM, Simmons G (eds) FLAIRS conference. AAAI Press, pp 354–358
Wang F, Zhang C (2007) Feature extraction by maximizing the average neighborhood margin. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Computer Society
Waugh F (1945) A note concerning hotelling’s method of inverting a partitioned matrix. Ann Math Stat 16(2): 216–217
Article MathSciNet Google Scholar
Witten IH, Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco
Google Scholar
Xia Y, Wong K-F (2006) Binarization approaches to email categorization. In: ICCPOL, pp 474–481
Xue G-R, Dai W, Yang Q, Yu Y (2008) Topic-bridged pLSA for cross-domain text classification. In: SIGIR ’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 627–634
Yan J, Zhang B, Liu N, Yan S, Cheng Q, Fan W, Yang Q, Xi W, Chen Z (2006) Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing. IEEE Trans Knowl Data Eng 18(3): 320–333
Article Google Scholar
Yu B, Xu Z-b (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Syst 21(4): 355–362
Article Google Scholar
Zhang Z, Phan X-H, SH (2008) An efficient feature selection using hidden topic in text categorization. In: AINAW ’08: proceedings of the 22nd international conference on advanced information networking and applications—Workshops, pp 1223–1228
Zhou S, Li K, Liu Y (2008) Text categorization based on topic model. In: Wang G, Li T, Grzymala-Busse J, Miao D, Skowron A, Yao Y (eds) Rough sets and knowledge technology. Lecture notes in computer science, vol 5009, pp 572–579

Download references

Author information

Authors and Affiliations

Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium
Juan Carlos Gomez, Erik Boiy & Marie-Francine Moens

Authors

Juan Carlos Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Erik Boiy
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Carlos Gomez.

About this article

Cite this article

Gomez, J.C., Boiy, E. & Moens, MF. Highly discriminative statistical features for email classification. Knowl Inf Syst 31, 23–53 (2012). https://doi.org/10.1007/s10115-011-0403-7

Download citation

Received: 01 February 2010
Revised: 26 January 2011
Accepted: 24 February 2011
Published: 18 May 2011
Issue Date: April 2012
DOI: https://doi.org/10.1007/s10115-011-0403-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Highly discriminative statistical features for email classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Email Classification Techniques—A Review

A Primer of Statistical Methods for Classification

Supervised Machine Learning Classifier for Email Spam Filtering

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Highly discriminative statistical features for email classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Email Classification Techniques—A Review

A Primer of Statistical Methods for Classification

Supervised Machine Learning Classifier for Email Spam Filtering

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation