Using Biased Discriminant Analysis for Email Filtering

Gomez, Juan Carlos; Moens, Marie-Francine

doi:10.1007/978-3-642-15387-7_60

Juan Carlos Gomez²³ &
Marie-Francine Moens²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6276))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1801 Accesses
4 Citations

Abstract

This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks.

We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus.

The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aha, D.W., Kibler, D.F., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Ch, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of naive bayesian anti-spam filtering. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 9–17. Springer, Heidelberg (2000)
Google Scholar
Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Compututation 12(10), 2385–2404 (2000)
Article Google Scholar
István, B., Jácint, S., Benczúr, A.A.: Latent dirichlet allocation in web spam filtering. In: AIRWeb 2008: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 29–32 (2008)
Google Scholar
Bishop, C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003 (2003)
Google Scholar
Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)
MathSciNet Google Scholar
Breiman, L.: Bagging predictors. In: Machine Learning, pp. 123–140 (1996)
Google Scholar
Cormack, G.V.: Spam track overview. In: TREC-2007: Sixteenth Text REtrieval Conference (2007)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)
Article Google Scholar
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM, New York (2007)
Chapter Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990)
MATH Google Scholar
Goodman, J., Heckerman, D., Rounthwaite, R.: Stopping spam. Scientific American 292(4), 42–88 (2005)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Uncertainty in Artificial Intelligence, pp. 50–57 (1999)
Google Scholar
Huang, T.S., Dagli, C.K., Rajaram, S., Chang, E.Y., Mandel, M.I., Poliner, G.E., Ellis, D.P.W.: Active learning for interactive multimedia retrieval. Proceedings of the IEEE 96(4), 648–667 (2008)
Article Google Scholar
Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Words vs. character n-grams for anti-spam filtering. Int. Journal on Artificial Intelligence Tools 16(6), 1047–1067 (2007)
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Systems with Applications 36, 10206–10222 (2009)
Article Google Scholar
Yu, B., Xu, Z.-b.: A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Systems 21(4), 355–362 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

ITESM, Eugenio Garza Sada 2501, Monterrey, NL, 64849, Mexico
Juan Carlos Gomez
Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001, Heverlee, Belgium
Marie-Francine Moens

Authors

Juan Carlos Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, The Parade, Cardiff University, CF24 3AA, Cardiff, UK
Rossitza Setchi
Dept. of Computer Science and Software Engineering, BUckingham Building, Lion Terrace, University of Portsmouth, PO1 3HE, Portsmouth, UK
Ivan Jordanov
KES International, 145-157, St. John Street, EC1V 4PY, London, UK
Robert J. Howlett
School of Electrical and Information Engineering, University of South Australia, ,, Adelaide, Mawson Lakes Campus, 5095, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gomez, J.C., Moens, MF. (2010). Using Biased Discriminant Analysis for Email Filtering. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15387-7_60

Download citation

DOI: https://doi.org/10.1007/978-3-642-15387-7_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15386-0
Online ISBN: 978-3-642-15387-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics