Spam Filtering through Anomaly Detection

Santos, Igor; Laorden, Carlos; Ugarte-Pedrero, Xabier; Sanz, Borja; Bringas, Pablo G.

doi:10.1007/978-3-642-35755-8_15

Igor Santos⁴,
Carlos Laorden⁴,
Xabier Ugarte-Pedrero⁴,
Borja Sanz⁴ &
…
Pablo G. Bringas⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 314))

Included in the following conference series:

International Conference on E-Business and Telecommunications

1074 Accesses
1 Citations

Abstract

More than 85% of received e-mails are spam. Spam is an important issue for computer security because it is used to spread other threats such as computer viruses, worms or phishing. Classic techniques to fight spam, including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Machine-learning techniques trained using statistical representations of the terms that usually appear in the e-mails are widely used in the literature. However, these methods demand a time-consuming training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. In this paper, we present the first spam filtering method based on anomaly detection that reduces the necessity of labelling spam messages and only uses the representation of legitimate e-mails. This approach represents legitimate e-mails as word frequency vectors. Thereby, an email is classified as spam or legitimate by measuring its deviation to the representation of these legitimate e-mails. This method achieves high accuracy rates detecting spam and maintains a low false positive rate, reducing the effort produced by labelling spam.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bratko, A., Filipič, B., Cormack, G., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. The Journal of Machine Learning Research 7, 2673–2698 (2006)
MATH Google Scholar
Jagatic, T., Johnson, N., Jakobsson, M., Menczer, F.: Social phishing. Communications of the ACM 50, 94–100 (2007)
Article Google Scholar
Carpinter, J., Hunt, R.: Tightening the net: A review of current and next generation spam filtering tools. Computers & Security 25, 566–578 (2006)
Article Google Scholar
Heron, S.: Technologies for spam detection. Network Security, 11–15 (2009)
Google Scholar
Jung, J., Sit, E.: An empirical study of spam traffic and the use of DNS black lists. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM, New York (2004)
Chapter Google Scholar
Ramachandran, A., Dagon, D., Feamster, N.: Can DNS-based blacklists keep up with bots. In: Conference on Email and Anti-Spam, Citeseer (2006)
Google Scholar
Kołcz, A., Chowdhury, A., Alspector, J.: The impact of feature selection on signature-driven spam detection. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS 2004 (2004)
Google Scholar
Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1–6 (2005)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)
Article MathSciNet Google Scholar
Lewis, D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–18. Springer, Heidelberg (1998)
Chapter Google Scholar
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proceedings of the Machine Learning and Textual Information Access Workshop of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (2000)
Google Scholar
Schneider, K.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 307–314 (2003)
Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An evaluation of naive bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)
Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Spyropoulos, C.: An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)
Google Scholar
Seewald, A.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis 11, 497–524 (2007)
Google Scholar
Vapnik, V.: The nature of statistical learning theory. Springer (2000)
Google Scholar
Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)
Article Google Scholar
Blanzieri, E., Bryl, A.: Instance-based spam filtering using SVM nearest neighbor classifier. In: Proceedings of FLAIRS-20, pp. 441–442 (2007)
Google Scholar
Sculley, D., Wachman, G.: Relaxed online SVMs for spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–422 (2007)
Google Scholar
Quinlan, J.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Google Scholar
Carreras, X., Márquez, L.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Citeseer, pp. 58–64 (2001)
Google Scholar
Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3, 243–269 (2004)
Article Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)
Article MATH Google Scholar
Wilbur, W., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18, 45–55 (1992)
Article Google Scholar
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Google Scholar
McGill, M., Salton, G.: Introduction to modern information retrieval. McGraw-Hill (1983)
Google Scholar
Kent, J.: Information gain and a general measure of correlation. Biometrika 70, 163–173 (1983)
Article MathSciNet MATH Google Scholar
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval 6, 49–73 (2003)
Article Google Scholar
Cranor, L., LaMacchia, B.: Spam! Communications of the ACM 41, 74–83 (1998)
Article Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05, Madison, Wisconsin, vol. 62 (1998)
Google Scholar
Lovins, J.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence, vol. 14, pp. 1137–1145 (1995)
Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the 2001 International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)
Google Scholar
Cohen, D.: Explaining linguistic phenomena. Halsted Press (1974)
Google Scholar
Polyvyanyy, A.: Evaluation of a novel information retrieval model: eTVSM. MSc Dissertation (2007)
Google Scholar
Carnap, R.: Meaning and synonymy in natural languages. Philosophical Studies 6, 33–47 (1955)
Article Google Scholar
Cruse, D.: Hyponymy and lexical hierarchies. Archivum Linguisticum 6, 26–31 (1975)
Google Scholar
Radden, G., Kövecses, Z.: Towards a theory of metonymy. Metonymy in Language and Thought, 17–59 (1999)
Google Scholar
Ming-Tzu, K., Nation, P.: Word meaning in academic English: Homography in the academic word list. Applied Linguistics 25, 291–314 (2004)
Article Google Scholar
Becker, J., Kuropka, D.: Topic-based vector space model. In: Proceedings of the 6th International Conference on Business Information Systems, pp. 7–12 (2003)
Google Scholar
Karlberger, C., Bayler, G., Kruegel, C., Kirda, E.: Exploiting redundancy in natural language to penetrate bayesian spam filters. In: Proceedings of the 1st USENIX Workshop on Offensive Technologies (WOOT), pp. 1–7. USENIX Association (2007)
Google Scholar
Kuropka, D.: Modelle zur Repräsentation natürlichsprachlicher Dokumente-Information-Filtering und-Retrieval mit relationalen Datenbanken. Advances in Information Systems and Management Science 10 (2004)
Google Scholar
Awad, A., Polyvyanyy, A., Weske, M.: Semantic querying of business process models. In: IEEE International Conference on Enterprise Distributed Object Computing Conference (EDOC 2008), pp. 85–94 (2008)
Google Scholar
Ide, N., Véronis, J.: Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics 24, 2–40 (1998)
Google Scholar
Navigli, R.: Word sense disambiguation: a survey. ACM Computing Surveys (CSUR) 41, 10 (2009)
Article Google Scholar
Bates, M., Weischedel, R.: Challenges in natural language processing. Cambridge Univ. Pr. (1993)
Google Scholar
Dietterich, T., Lathrop, R., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)
Article MATH Google Scholar
Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Advances in Neural Information Processing Systems, pp. 570–576 (1998)
Google Scholar
Zhou, Y., Jorgensen, Z., Inge, M.: Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 02, pp. 298–305. IEEE Computer Society (2007)
Google Scholar
Wittel, G., Wu, S.: On attacking statistical spam filters. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS (2004)
Google Scholar
Cano, J., Herrera, F., Lozano, M.: On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing Journal 6, 323–332 (2006)
Article Google Scholar
Czarnowski, I., Jedrzejowicz, P.: Instance reduction approach to machine learning and multi-database mining. In: Proceedings of the Scientific Session Organized During XXI Fall Meeting of the Polish Information Processing Society, Informatica, pp. 60–71. ANNALES Universitatis Mariae Curie-Skłodowska, Lublin (2006)
Google Scholar
Pyle, D.: Data preparation for data mining. Morgan Kaufmann (1999)
Google Scholar
Tsang, E., Yeung, D., Wang, X.: OFFSS: optimal fuzzy-valued feature subset selection. IEEE Transactions on Fuzzy Systems 11, 202–213 (2003)
Article Google Scholar
Torkkola, K.: Feature extraction by non parametric mutual information maximization. The Journal of Machine Learning Research 3, 1415–1438 (2003)
MathSciNet MATH Google Scholar
Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151, 155–176 (2003)
Article MathSciNet MATH Google Scholar
Liu, H., Motoda, H.: Instance selection and construction for data mining. Kluwer Academic Pub. (2001)
Google Scholar
Liu, H., Motoda, H.: Computational methods of feature selection. Chapman & Hall/CRC (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

S3Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain
Igor Santos, Carlos Laorden, Xabier Ugarte-Pedrero, Borja Sanz & Pablo G. Bringas

Authors

Igor Santos
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Laorden
View author publications
You can also search for this author in PubMed Google Scholar
Xabier Ugarte-Pedrero
View author publications
You can also search for this author in PubMed Google Scholar
Borja Sanz
View author publications
You can also search for this author in PubMed Google Scholar
Pablo G. Bringas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Monmouth University, 07764, West Long Branch, NJ, U.S.A.
Mohammad S. Obaidat
University of Seville, C/S. Fernando, 4, C.P. 41004, Sevilla, Spain
José L. Sevillano
Departament of Systems and Informatics, Polytechnic Institute of Setúbal – INSTICC, Rua do Vale de Chaves - Estefanilha, 2910-761, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, I., Laorden, C., Ugarte-Pedrero, X., Sanz, B., Bringas, P.G. (2012). Spam Filtering through Anomaly Detection. In: Obaidat, M.S., Sevillano, J.L., Filipe, J. (eds) E-Business and Telecommunications. ICETE 2011. Communications in Computer and Information Science, vol 314. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35755-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-35755-8_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35754-1
Online ISBN: 978-3-642-35755-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics