Abstract
This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks.
We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus.
The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aha, D.W., Kibler, D.F., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Ch, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of naive bayesian anti-spam filtering. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 9–17. Springer, Heidelberg (2000)
Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Compututation 12(10), 2385–2404 (2000)
István, B., Jácint, S., Benczúr, A.A.: Latent dirichlet allocation in web spam filtering. In: AIRWeb 2008: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 29–32 (2008)
Bishop, C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003 (2003)
Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)
Breiman, L.: Bagging predictors. In: Machine Learning, pp. 123–140 (1996)
Cormack, G.V.: Spam track overview. In: TREC-2007: Sixteenth Text REtrieval Conference (2007)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM, New York (2007)
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990)
Goodman, J., Heckerman, D., Rounthwaite, R.: Stopping spam. Scientific American 292(4), 42–88 (2005)
Hofmann, T.: Probabilistic latent semantic indexing. In: Uncertainty in Artificial Intelligence, pp. 50–57 (1999)
Huang, T.S., Dagli, C.K., Rajaram, S., Chang, E.Y., Mandel, M.I., Poliner, G.E., Ellis, D.P.W.: Active learning for interactive multimedia retrieval. Proceedings of the IEEE 96(4), 648–667 (2008)
Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Words vs. character n-grams for anti-spam filtering. Int. Journal on Artificial Intelligence Tools 16(6), 1047–1067 (2007)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Systems with Applications 36, 10206–10222 (2009)
Yu, B., Xu, Z.-b.: A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Systems 21(4), 355–362 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gomez, J.C., Moens, MF. (2010). Using Biased Discriminant Analysis for Email Filtering. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15387-7_60
Download citation
DOI: https://doi.org/10.1007/978-3-642-15387-7_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15386-0
Online ISBN: 978-3-642-15387-7
eBook Packages: Computer ScienceComputer Science (R0)