Skip to main content

Using Biased Discriminant Analysis for Email Filtering

  • Conference paper
Knowledge-Based and Intelligent Information and Engineering Systems (KES 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6276))

Abstract

This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks.

We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus.

The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aha, D.W., Kibler, D.F., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)

    Google Scholar 

  2. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Ch, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of naive bayesian anti-spam filtering. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 9–17. Springer, Heidelberg (2000)

    Google Scholar 

  3. Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Compututation 12(10), 2385–2404 (2000)

    Article  Google Scholar 

  4. István, B., Jácint, S., Benczúr, A.A.: Latent dirichlet allocation in web spam filtering. In: AIRWeb 2008: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 29–32 (2008)

    Google Scholar 

  5. Bishop, C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)

    Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003 (2003)

    Google Scholar 

  7. Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)

    MathSciNet  Google Scholar 

  8. Breiman, L.: Bagging predictors. In: Machine Learning, pp. 123–140 (1996)

    Google Scholar 

  9. Cormack, G.V.: Spam track overview. In: TREC-2007: Sixteenth Text REtrieval Conference (2007)

    Google Scholar 

  10. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)

    Article  Google Scholar 

  11. Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM, New York (2007)

    Chapter  Google Scholar 

  12. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990)

    MATH  Google Scholar 

  13. Goodman, J., Heckerman, D., Rounthwaite, R.: Stopping spam. Scientific American 292(4), 42–88 (2005)

    Article  Google Scholar 

  14. Hofmann, T.: Probabilistic latent semantic indexing. In: Uncertainty in Artificial Intelligence, pp. 50–57 (1999)

    Google Scholar 

  15. Huang, T.S., Dagli, C.K., Rajaram, S., Chang, E.Y., Mandel, M.I., Poliner, G.E., Ellis, D.P.W.: Active learning for interactive multimedia retrieval. Proceedings of the IEEE 96(4), 648–667 (2008)

    Article  Google Scholar 

  16. Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Words vs. character n-grams for anti-spam filtering. Int. Journal on Artificial Intelligence Tools 16(6), 1047–1067 (2007)

    Article  Google Scholar 

  17. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  18. Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Systems with Applications 36, 10206–10222 (2009)

    Article  Google Scholar 

  19. Yu, B., Xu, Z.-b.: A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Systems 21(4), 355–362 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gomez, J.C., Moens, MF. (2010). Using Biased Discriminant Analysis for Email Filtering. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15387-7_60

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15387-7_60

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15386-0

  • Online ISBN: 978-3-642-15387-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics