Abstract
This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
Similar content being viewed by others
References
Abu-Nimeh S, Nappa D, Wang X, Nair S (2007) A comparison of machine learning techniques for phishing detection. In: eCrime ’07: proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit. ACM, New York, pp 60–69
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: proceedings of the 1993 ACM SIGMOD international conference on management of Data. ACM, New York, NY, USA, pp 207–216
Aha DW, Kibler DF, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6: 37–66
Androutsopoulos I, Koutsias J, Chandrinos KV, Ch KV, Paliouras G, Spyropoulos CD (2000) An evaluation of naïve Bayesian anti-spam filtering, pp 9–17
Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Comput 12(10): 2385–2404
Bishop C (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003) Hierarchical topic models and the nested Chinese restaurant process. In: Thrun S, Saul LK, Schölkopf B (eds) Neural information processing systems. MIT Press, Cambridge
Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent dirichlet allocation. J Mach Learn Res 3: 2003
Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of 15th conference on computational statistics (COMPSTAT 2002). Physica Verlag, Heidelberg, Germany
Brank J, Grobelnik M, Frayling MN, Mladenic D (2002) Feature selection using support vector machines. In: Proceedings of the third international conference on data mining methods and databases for engineering, finance, and other fields, Bologna, Italy, pp 25–27
Bratko A, Cormack G, Filipic B, Lynam T, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7: 2673–2698
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
Brutlag JD, Meek C (2000) Challenges of the email domain for text classification. In: ICML ’00: proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 103–110
Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: SIGIR ’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrievalm, pp 182–189
Carreras X, Márquez L, Salgado JG (2001) Boosting trees for anti-spam email filtering. In: RANLP-01: 4th international conference on recent advances in natural language processing pp 58–64
Chen C, Tian Y, Zhang C (2008) Spam filtering with several novel Bayesian classifiers. In: ICPR ’08: proceedings of the 19th international conference on pattern recognition, pp 1–4
Cheng H, Yan X, Han J, wei Hsu C (2007) Discriminative frequent pattern analysis for effective classification. In: IEEE 23rd international conference on data engineering, pp 716–725
Cormack GV (2007) Spam track overview. In: TREC-2007: sixteenth text retrieval conference
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press, Cambridge, UK
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407
Fette I, Sadeh N, Tomasic A (2007) Learning to detect phishing emails. In: WWW ’07: proceedings of the 16th international conference on World Wide Web. ACM, New York, NY, USA, pp 649–656
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, London
Goodman J, Heckerman D, Rounthwaite R (2005) Stopping spam. Sci Am 292(4): 42–88
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Lear 46(1–3): 389–422
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36: 10206–10222
Hartley R, Schaffalizky F (2003) PowerFactorization: 3d reconstruction with missing or uncertain data. In: Australia–Japan advanced workshop on computer vision
Hofmann T (1999) Probabilistic latent semantic indexing. In: Uncertainty in artificial intelligence, pp 50–57
Hovold J (2005) Naïve Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds. In: NODALIDA ’05: proceedings of the 15th nordic conference of computational linguistics, pp 78–87
Huang TS, Dagli CK, Rajaram S, Chang EY, Mandel MI, Poliner GE, Ellis DPW (2008) Active learning for interactive multimedia retrieval. Proc IEEE 96(4): 648–667
Ishii N, Murai T, Yamada T, Bao Y, Suzuki S (2006) Text classification: combining grouping, LSA and knn vs support vector machine. In: ‘Knowledge-Based Intelligent Information and Engineering Systems’ Vol. 4252, pp. 393–400
István B, Jácint S, András B (2008) Latent Dirichlet allocation in web spam filtering. In: AIRWeb ’08: proceedings of the 4th international workshop on adversarial information retrieval on the Web’ pp 29–32
Jolliffe IT (1986) Principal component analysis. Springer, New York
Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(6): 1047–1067
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391
Meyer TA, Whateley B (2004) SpamBayes: effective open-source, Bayesian based, email classification syste. In: CEAS ’04: proceedings of the first conference on email and anti-spam
Mitchell TM (1997) Machine learning. McGraw-Hill Science/Engineering/Math, NY
Mladenić D, Brank J, Grobelnik M, Milic-Frayling N (2004) Feature selection using linear classifier weights: interaction with classification models. In: SIGIR ’04: proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA pp 234–241
Moler CB, Stewart GW (1973) An algorithm for generalized matrix eigenvalue problems. SIAM: J Numer Anal (19):241–256
Platt JC (1998) Fast training of SVMs using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in kernel methods-support vector learning. MIT Press, Cambridge, pp 185–208
Pu Q, Yang G-W (2006) Short-text classification based on ICA and LSA. In: Advances in neural networks, vol 3972, pp. 265–270
Qian T, Xiong H, Wang Y, Chen E (2007) On the strength of hyperclique patterns for text categorization. Inf Sci 177(19): 4040–4058
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo
Robinson G (2003) A statistical approach to the spam problem. Linux J (107):3
Schneider K-M (2003) A comparison of event models for naïve Bayes anti-spam e-mail filtering. In: EACL ’03: proceedings of the tenth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 307–314
Siefkes C, Assis F, Chhabra S, Yerazunis WS (2004) Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: PKDD ’04: proceedings of the 8th European conference on principles and practice of knowledge discovery in databases, vol 3202. Springer, Morristown, NJ, USA, pp. 410–421
Torkkola K (2004) Discriminative features for document classification. Pattern Anal Appl 6: 301–308
Tsymbal A, Puuronen S, Pechenizkiy M, Baumgarten M, Patterson DW (2002) Eigenvector-based feature extraction for classification. In: Haller SM, Simmons G (eds) FLAIRS conference. AAAI Press, pp 354–358
Wang F, Zhang C (2007) Feature extraction by maximizing the average neighborhood margin. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Computer Society
Waugh F (1945) A note concerning hotelling’s method of inverting a partitioned matrix. Ann Math Stat 16(2): 216–217
Witten IH, Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco
Xia Y, Wong K-F (2006) Binarization approaches to email categorization. In: ICCPOL, pp 474–481
Xue G-R, Dai W, Yang Q, Yu Y (2008) Topic-bridged pLSA for cross-domain text classification. In: SIGIR ’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 627–634
Yan J, Zhang B, Liu N, Yan S, Cheng Q, Fan W, Yang Q, Xi W, Chen Z (2006) Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing. IEEE Trans Knowl Data Eng 18(3): 320–333
Yu B, Xu Z-b (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Syst 21(4): 355–362
Zhang Z, Phan X-H, SH (2008) An efficient feature selection using hidden topic in text categorization. In: AINAW ’08: proceedings of the 22nd international conference on advanced information networking and applications—Workshops, pp 1223–1228
Zhou S, Li K, Liu Y (2008) Text categorization based on topic model. In: Wang G, Li T, Grzymala-Busse J, Miao D, Skowron A, Yao Y (eds) Rough sets and knowledge technology. Lecture notes in computer science, vol 5009, pp 572–579
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Gomez, J.C., Boiy, E. & Moens, MF. Highly discriminative statistical features for email classification. Knowl Inf Syst 31, 23–53 (2012). https://doi.org/10.1007/s10115-011-0403-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0403-7