Skip to main content

Spam Filtering through Anomaly Detection

  • Conference paper
E-Business and Telecommunications (ICETE 2011)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 314))

Included in the following conference series:

Abstract

More than 85% of received e-mails are spam. Spam is an important issue for computer security because it is used to spread other threats such as computer viruses, worms or phishing. Classic techniques to fight spam, including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Machine-learning techniques trained using statistical representations of the terms that usually appear in the e-mails are widely used in the literature. However, these methods demand a time-consuming training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. In this paper, we present the first spam filtering method based on anomaly detection that reduces the necessity of labelling spam messages and only uses the representation of legitimate e-mails. This approach represents legitimate e-mails as word frequency vectors. Thereby, an email is classified as spam or legitimate by measuring its deviation to the representation of these legitimate e-mails. This method achieves high accuracy rates detecting spam and maintains a low false positive rate, reducing the effort produced by labelling spam.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bratko, A., Filipič, B., Cormack, G., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. The Journal of Machine Learning Research 7, 2673–2698 (2006)

    MATH  Google Scholar 

  2. Jagatic, T., Johnson, N., Jakobsson, M., Menczer, F.: Social phishing. Communications of the ACM 50, 94–100 (2007)

    Article  Google Scholar 

  3. Carpinter, J., Hunt, R.: Tightening the net: A review of current and next generation spam filtering tools. Computers & Security 25, 566–578 (2006)

    Article  Google Scholar 

  4. Heron, S.: Technologies for spam detection. Network Security, 11–15 (2009)

    Google Scholar 

  5. Jung, J., Sit, E.: An empirical study of spam traffic and the use of DNS black lists. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM, New York (2004)

    Chapter  Google Scholar 

  6. Ramachandran, A., Dagon, D., Feamster, N.: Can DNS-based blacklists keep up with bots. In: Conference on Email and Anti-Spam, Citeseer (2006)

    Google Scholar 

  7. Kołcz, A., Chowdhury, A., Alspector, J.: The impact of feature selection on signature-driven spam detection. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS 2004 (2004)

    Google Scholar 

  8. Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1–6 (2005)

    Google Scholar 

  9. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  10. Lewis, D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–18. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  11. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proceedings of the Machine Learning and Textual Information Access Workshop of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (2000)

    Google Scholar 

  12. Schneider, K.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 307–314 (2003)

    Google Scholar 

  13. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An evaluation of naive bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)

    Google Scholar 

  14. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Spyropoulos, C.: An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)

    Google Scholar 

  15. Seewald, A.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis 11, 497–524 (2007)

    Google Scholar 

  16. Vapnik, V.: The nature of statistical learning theory. Springer (2000)

    Google Scholar 

  17. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)

    Article  Google Scholar 

  18. Blanzieri, E., Bryl, A.: Instance-based spam filtering using SVM nearest neighbor classifier. In: Proceedings of FLAIRS-20, pp. 441–442 (2007)

    Google Scholar 

  19. Sculley, D., Wachman, G.: Relaxed online SVMs for spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–422 (2007)

    Google Scholar 

  20. Quinlan, J.: Induction of decision trees. Machine Learning 1, 81–106 (1986)

    Google Scholar 

  21. Carreras, X., Márquez, L.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Citeseer, pp. 58–64 (2001)

    Google Scholar 

  22. Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3, 243–269 (2004)

    Article  Google Scholar 

  23. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  24. Wilbur, W., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18, 45–55 (1992)

    Article  Google Scholar 

  25. Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  26. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)

    Google Scholar 

  27. McGill, M., Salton, G.: Introduction to modern information retrieval. McGraw-Hill (1983)

    Google Scholar 

  28. Kent, J.: Information gain and a general measure of correlation. Biometrika 70, 163–173 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  29. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval 6, 49–73 (2003)

    Article  Google Scholar 

  30. Cranor, L., LaMacchia, B.: Spam! Communications of the ACM 41, 74–83 (1998)

    Article  Google Scholar 

  31. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05, Madison, Wisconsin, vol. 62 (1998)

    Google Scholar 

  32. Lovins, J.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)

    Google Scholar 

  33. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence, vol. 14, pp. 1137–1145 (1995)

    Google Scholar 

  34. Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the 2001 International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)

    Google Scholar 

  35. Cohen, D.: Explaining linguistic phenomena. Halsted Press (1974)

    Google Scholar 

  36. Polyvyanyy, A.: Evaluation of a novel information retrieval model: eTVSM. MSc Dissertation (2007)

    Google Scholar 

  37. Carnap, R.: Meaning and synonymy in natural languages. Philosophical Studies 6, 33–47 (1955)

    Article  Google Scholar 

  38. Cruse, D.: Hyponymy and lexical hierarchies. Archivum Linguisticum 6, 26–31 (1975)

    Google Scholar 

  39. Radden, G., Kövecses, Z.: Towards a theory of metonymy. Metonymy in Language and Thought, 17–59 (1999)

    Google Scholar 

  40. Ming-Tzu, K., Nation, P.: Word meaning in academic English: Homography in the academic word list. Applied Linguistics 25, 291–314 (2004)

    Article  Google Scholar 

  41. Becker, J., Kuropka, D.: Topic-based vector space model. In: Proceedings of the 6th International Conference on Business Information Systems, pp. 7–12 (2003)

    Google Scholar 

  42. Karlberger, C., Bayler, G., Kruegel, C., Kirda, E.: Exploiting redundancy in natural language to penetrate bayesian spam filters. In: Proceedings of the 1st USENIX Workshop on Offensive Technologies (WOOT), pp. 1–7. USENIX Association (2007)

    Google Scholar 

  43. Kuropka, D.: Modelle zur Repräsentation natürlichsprachlicher Dokumente-Information-Filtering und-Retrieval mit relationalen Datenbanken. Advances in Information Systems and Management Science 10 (2004)

    Google Scholar 

  44. Awad, A., Polyvyanyy, A., Weske, M.: Semantic querying of business process models. In: IEEE International Conference on Enterprise Distributed Object Computing Conference (EDOC 2008), pp. 85–94 (2008)

    Google Scholar 

  45. Ide, N., Véronis, J.: Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics 24, 2–40 (1998)

    Google Scholar 

  46. Navigli, R.: Word sense disambiguation: a survey. ACM Computing Surveys (CSUR) 41, 10 (2009)

    Article  Google Scholar 

  47. Bates, M., Weischedel, R.: Challenges in natural language processing. Cambridge Univ. Pr. (1993)

    Google Scholar 

  48. Dietterich, T., Lathrop, R., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)

    Article  MATH  Google Scholar 

  49. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Advances in Neural Information Processing Systems, pp. 570–576 (1998)

    Google Scholar 

  50. Zhou, Y., Jorgensen, Z., Inge, M.: Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 02, pp. 298–305. IEEE Computer Society (2007)

    Google Scholar 

  51. Wittel, G., Wu, S.: On attacking statistical spam filters. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS (2004)

    Google Scholar 

  52. Cano, J., Herrera, F., Lozano, M.: On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing Journal 6, 323–332 (2006)

    Article  Google Scholar 

  53. Czarnowski, I., Jedrzejowicz, P.: Instance reduction approach to machine learning and multi-database mining. In: Proceedings of the Scientific Session Organized During XXI Fall Meeting of the Polish Information Processing Society, Informatica, pp. 60–71. ANNALES Universitatis Mariae Curie-Skłodowska, Lublin (2006)

    Google Scholar 

  54. Pyle, D.: Data preparation for data mining. Morgan Kaufmann (1999)

    Google Scholar 

  55. Tsang, E., Yeung, D., Wang, X.: OFFSS: optimal fuzzy-valued feature subset selection. IEEE Transactions on Fuzzy Systems 11, 202–213 (2003)

    Article  Google Scholar 

  56. Torkkola, K.: Feature extraction by non parametric mutual information maximization. The Journal of Machine Learning Research 3, 1415–1438 (2003)

    MathSciNet  MATH  Google Scholar 

  57. Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151, 155–176 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  58. Liu, H., Motoda, H.: Instance selection and construction for data mining. Kluwer Academic Pub. (2001)

    Google Scholar 

  59. Liu, H., Motoda, H.: Computational methods of feature selection. Chapman & Hall/CRC (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Santos, I., Laorden, C., Ugarte-Pedrero, X., Sanz, B., Bringas, P.G. (2012). Spam Filtering through Anomaly Detection. In: Obaidat, M.S., Sevillano, J.L., Filipe, J. (eds) E-Business and Telecommunications. ICETE 2011. Communications in Computer and Information Science, vol 314. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35755-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35755-8_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35754-1

  • Online ISBN: 978-3-642-35755-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics